Text Precis Utility

Functional Specification

Version	0.2
Date	20 Nov 2002
Author	Hugh Miles (hugh_miles@users.sourceforge.net)

1. Introduction

Create an automatic summary of a text passage by using its internal references. This project stems from a talk I once heard on natural language processing. The algorithm used in this utility was one of the examples.

A text passage that runs to several sentences will contain implicit references because nouns and verbs are used in more than one sentence. Each sentence has a number of implicit forward references and a number of implicit backward references. A passable summary can be made by listing the three sentences with the best balance of forward and backward references.

1.1 Sentence analysis

The text is separated into sentences. This can be done by so simple a technique as looking for a full stop (period), question mark or exclamation mark.

1.2 Term analysis

Each sentence is analysed for the terms used in it. Here the algorithm requires a bit of intelligence to spot variants: "Linux's" and "Linux", or "developer", "developer's" and "developers". Also, there's a list of "noise" terms which should be disregarded: "until", "had", "a", "in", "if", "you", "is", etc.

1.3 Counting references

Count the forward and backward references for each sentence. A sentence has a backwards reference if it uses a term used in an earlier sentence. A sentence has a forward reference if it uses a term used in a later sentence.

Not surprisingly, the earlier sentences have a preponderance of forward references; the later sentences have a preponderance of backward references.

1.4 Selection

Select the sentences with the highest mean of forward and backwards references and the lowest standard deviation of the the forwards and backwards counts from the mean.

Generate the summary by taking the three highest ranked sentences and listing them in the order they appear in the text.

2. Interface

The prototype reads text from the standard input and writes the summary to the standard output.

Summit.exe [-d] < input file > output file
-d	Print diagnostic output

3. Configuration

The prototype reads a list of "noise" terms from a file named .summitrc. The CVS repository contains the following default list:

a	and	another	any	arev	as	at
b	because	been	but	by	c	d
e	f	finally	for	from	g	get
h	had	have	however	i	if	in
is	it	its	j	k	l	m
make	most	n	no	none	o	of
on	one	or	other	p	q	r
s	so	some	such	t	that	the
these	to	u	unless	until	v	w
was	were	when	will	with	would	x
y	you	your	z

4. Internationalization

There is no provision for internationalization in the prototype. The language used is British English.

5. Term matching

The prototype matches terms using the following rules.

ies plurals	xies -> xie	vies -> vie
ies plurals	xxxies -> xxxy	curries -> curry
es plurals	xxxes -> xxxe	gates -> gate
s plurals	xxxCs -> xxxC where C is a consonant other than 's'	rocks -> rock
ly adverbs	xxxly -> xxx	cowardly -> coward