To err is human, but to really foul things up requires a
computer.
Farmers' Almanac, 1978
Unlike many companies, Software Scientific is not a one-technology shop. Instead, we will deploy whatever technique is appropriate, often combining many to offset the weaknesses of one approach with the strengths of another.
Our APIs remain focused on the task in hand, rather than the mechanics of accomplishing that task.
Statistical systems |
At this time, there are no known statistical methods for achieving the
functionality which is available to Lectern users (the underlying technology is the
Concept Engine). It is difficult to envisage how statistical methods could be
extended to incorporate the functionality required.
Despite these limitations, statistical methods have been applied to some of the functions which are offered by the Concept Engine. For instance on the Internet, some search engines are starting to offer summaries. However, the claim being made quite explicitly here is that the Concept Engine technology produces better results than statistical techniques, and this can be demonstrated quite readily with simple tests. Statistical methods operate by calculating word frequency and correlating between associated words; this does not equate very well to the way in which people use language, and the results are erratic. The Concept Engine understands the meaning of words, and gives more reliable results when summarising text. |
Rule Based, |
Another approach that might seem reasonable at first sight is the
Computational Linguistic approach, which uses many rules to 'understand'
the text.
Such rule based systems are hardwired and they rely on numerous exceptions being built into them to cope with the anomalies of language. The core functionality would become increasingly unreliable as more sophisticated demands were put on it. Furthermore, rule based methods usually rely on large dictionaries of noun and verb data; this makes the summarising process quite slow and not readily extendable to multiple languages. |
In the following tables, our technology is in italics.
| Speed | Multi-lingual | Can be query focused? | Quality | |
|---|---|---|---|---|
| Discussion flow analysis | Fast | Yes | Yes | "Indicative" abstracts |
| Statistical | Fast | Yes | No | Poor |
| Deep grammars | Slow | No | No | Can invert meaning by accident |
| Speed | Parallel multi-lingual | "Find Similar" | Quality | Meaning aware | |
|---|---|---|---|---|---|
| Discussion flow analysis | Fast | Yes | Yes | Excellent | Yes |
| 'Word Bag' approach | Fast | No | No | Poor | No |
| Statistical | Fast | No | No | Fair | No |
| Speed | Accuracy | Machine overhead | Staff overhead | Relative cost | |
|---|---|---|---|---|---|
| Concept Engine | Fast | High | Low | Low | Low |
| Manual | Very slow | High for expert users | None | Experts only | High |
| Speed | Relevance ranking | Images | Overhead | |
|---|---|---|---|---|
| Bullets | Fast | Good | Yes | Low (10-15%) |
| Boolean inverted index | Fast | Poor | No | High (50-100%) |
| Thesaurus inverted index | Fast | Fair | No | High (70-125%) |
| N.L. inverted index | Fast | Good | No | High (90-150%) |
| Discussion flow analysis | Slow | Excellent | No | None (0%) |
| Neural Net / Pattern Recognition | Medium | Very poor | Yes | Various |
| Speed | Accuracy | Machine overhead | Staff overhead | Relative cost | |
|---|---|---|---|---|---|
| Auto Detect | Fast | High | Low | Low | Low |
| N-grams | Fast | Medium | Low | Low | Low |
| Manual | Very slow | Very high | None | High | High |