Version | 0.2 |
Date |
20 Nov 2002 |
Author |
Hugh Miles (hugh_miles@users.sourceforge.net) |
Create an automatic summary of a text passage by using its internal references. This project stems from a talk I once heard on natural language processing. The algorithm used in this utility was one of the examples.
A text passage that runs to several sentences will contain implicit references because nouns and verbs are used in more than one sentence. Each sentence has a number of implicit forward references and a number of implicit backward references. A passable summary can be made by listing the three sentences with the best balance of forward and backward references.
The text is separated into sentences. This can be done by so simple a technique as looking for a full stop (period), question mark or exclamation mark.
Each sentence is analysed for the terms used in it. Here the algorithm requires a bit of intelligence to spot variants: "Linux's" and "Linux", or "developer", "developer's" and "developers". Also, there's a list of "noise" terms which should be disregarded: "until", "had", "a", "in", "if", "you", "is", etc.
Count the forward and backward references for each sentence. A sentence has a backwards reference if it uses a term used in an earlier sentence. A sentence has a forward reference if it uses a term used in a later sentence.
Not surprisingly, the earlier sentences have a preponderance of forward references; the later sentences have a preponderance of backward references.
Select the sentences with the highest mean of forward and backwards references and the lowest standard deviation of the the forwards and backwards counts from the mean.
Generate the summary by taking the three highest ranked sentences and listing them in the order they appear in the text.
The prototype reads text from the standard input and writes the summary to the standard output.
Summit.exe [-d] < input file > output file | |
-d | Print diagnostic output |
The prototype reads a list of "noise" terms from a file named .summitrc. The CVS repository contains the following default list:
a | and | another | any | arev | as | at |
b | because | been | but | by | c | d |
e | f | finally | for | from | g | get |
h | had | have | however | i | if | in |
is | it | its | j | k | l | m |
make | most | n | no | none | o | of |
on | one | or | other | p | q | r |
s | so | some | such | t | that | the |
these | to | u | unless | until | v | w |
was | were | when | will | with | would | x |
y | you | your | z |
There is no provision for internationalization in the prototype. The language used is British English.
The prototype matches terms using the following rules.
ies plurals | xies -> xie | vies -> vie |
ies plurals | xxxies -> xxxy | curries -> curry |
es plurals | xxxes -> xxxe | gates -> gate |
s plurals | xxxCs -> xxxC where C is a consonant other than 's' | rocks -> rock |
ly adverbs | xxxly -> xxx | cowardly -> coward |