How does the abstractor work? This document describes some of the techniques used in the new HTML summariser.
Discussion
|
One of the principle components is Discussion Flow Analysis (DFA), a technique developed at Software Scientific's research laboratories, which evaluates the relevance of some text to a query or trigger. In essence, it measures how closely the flow of discussion in the text mirrors the flow of discussion in the original query. |
|
All words significant
|
Unlike statistical techiques, DFA takes note of all the words in the text being measured, including those regarded by many techniques as 'noise', such as 'and', and 'of'. DFA also uses the order in which words appear, just like people do. Furthermore, DFA is aware of the fact that most words can represent more than one concept, and that any one concept is also generally represented by more than one word. (Especially if you're working in more than one language simultaneously, as DFA can.) |
| Software Scientific uses DFA (in conjunction with its concept analysis) in its relevance ranking system more or less directly. This is seen most readily in the application 'Inquisitor'. | |
AbstractingDynamic shrink and grow |
How does this apply to abstracting? Well, Software Scientific's abstractors attempt to select the n sentences which, when read together, result in the higest DFA relevance. Since DFA uses all the words, this means that when expanding a summary from (say) two sentences long to three, it does not necessarily follow that both of the original sentences will be in the three sentence abstract. This makes sense from an information point of view, since the abstract you would choose to write using three sentences is not neccessarily a two sentence abstract plus another sentence. |
|
Heuristics |
Astute readers will notice that choosing the n best sentences out of a document m sentences long is an order m2 task. Because of this, we use heuristics to reduce the task to order m. It can be shown that the error thereby introduced is less than the intrinsic error of DFA, so the summary results are no worse. |
ParsingNatural language
|
Of course, this is not the entire problem. One also has to parse the original sentences. To do this we use a finite state automaton which can parse both natural language and HTML simultaneously with the natural langauge elements as a self-inducting (and therefore language independent) LL1 grammar. This means that, although heavily recursive, no back-tracking is ever needed, so parsing is fast. In addition, the parsing is able to strength-reduce sentences, increasing the information density of any resulting abstract. |
Summary (!)HTML structure-aware |
The upshot of all this is that a summariser based on DFA, and our HTML /
natural language simultaneous parser/strength reducer is much more accurate
than a dumb 'count the words' abstractor. It also allows the summarising of
structures in HTML, such as tables, lists, and images. (The relevance of an
image is determined by its textual context.) Simply put, the abstracts are
tighter, preserve the 'look and feel' of the original, and are just plain
better.
[ We know, we wrote it! - Ed.] |
P.S. |
The abstractor is about 15,000 lines of tight AI code, mostly written by humans, though with some code 'evolved' in genetic algorithms, compiling in all to about 200k of machine code. |