Text Precis Utility

Functional Specification

Version 0.2
20 Nov 2002
Hugh Miles (hugh_miles@users.sourceforge.net)

1. Introduction

Create an automatic summary of a text passage by using its internal references. This project stems from a talk I once heard on natural language processing. The algorithm used in this utility was one of the examples.

A text passage that runs to several sentences will contain implicit references because nouns and verbs are used in more than one sentence. Each sentence has a number of implicit forward references and a number of implicit backward references. A passable summary can be made by listing the three sentences with the best balance of forward and backward references.

1.1 Sentence analysis

The text is separated into sentences. This can be done by so simple a technique as looking for a full stop (period), question mark or exclamation mark.

1.2 Term analysis

Each sentence is analysed for the terms used in it. Here the algorithm requires a bit of intelligence to spot variants: "Linux's" and "Linux", or "developer", "developer's" and "developers". Also, there's a list of "noise" terms which should be disregarded: "until", "had", "a", "in", "if", "you", "is", etc.

1.3 Counting references

Count the forward and backward references for each sentence. A sentence has a backwards reference if it uses a term used in an earlier sentence. A sentence has a forward reference if it uses a term used in a later sentence.

Not surprisingly, the earlier sentences have a preponderance of forward references; the later sentences have a preponderance of backward references.

1.4 Selection

Select the sentences with the highest mean of forward and backwards references and the lowest standard deviation of the the forwards and backwards counts from the mean.

Generate the summary by taking the three highest ranked sentences and listing them in the order they appear in the text.

2. Interface

The prototype reads text from the standard input and writes the summary to the standard output.

Summit.exe [-d] < input file > output file
-d Print diagnostic output

3. Configuration

The prototype reads a list of "noise" terms from a file named .summitrc. The CVS repository contains the following default list:

a and another any arev as at
b because been but by c d
e f finally for from g get
h had have however i if in
is it its j k l m
make most n no none o of
on one or other p q r
s so some such t that the
these to u unless until v w
was were when will with would x
y you your z

4. Internationalization

There is no provision for internationalization in the prototype. The language used is British English.

5. Term matching

The prototype matches terms using the following rules.

ies plurals xies -> xie vies -> vie
ies plurals xxxies -> xxxy curries -> curry
es plurals xxxes -> xxxe gates -> gate
s plurals xxxCs -> xxxC where C is a consonant other than 's' rocks -> rock
ly adverbs xxxly -> xxx cowardly -> coward

SourceForge.net Logo