Machine Intelligence
Frequently Asked Questions
These are questions that we are frequently asked by customers.
We will be adding to this page as more are asked!
If you have a question you would like added to this list,
please contact us and let us know.
Summariser Questions
What features are important in a summariser?
- When considering different summarisers,
both quality and performance are very significant,
so test any to check the quality of the summary produced against
the type of documents you will be summarising.
- The summariser should be multi-lingual.
-
To be useful in any type of searching environment, the summariser
must be able to focus the summary on the original search query.
This enables the 'drill-down' paradigm.
-
Equally important is the ability to produce pure abstracts based on
the document author's original point of view.
-
In the web or intranet environment the summariser must be able to
understand HTML, so it correctly handles:
- tables
- images
- bullet lists
This means that it can, reproduce the 'look-and-feel' of the original
document within the constraints of a summary. So, for example, if part of a
table is used in a summary, the relationships between rows
and columns are preserved, even if the entire table is not
used. Images are also asessed for relevance, and will
only appear if they add substantial value to the summary. Images should be
assessed using only the HTML, so preserving performance.
- The summariser should be designed to work well in the web
environment, and not require a 'statefull' connection,
and yet still provide local drill-down into the document.
- The summariser should also be designed to work alongside search
engines. For example, you should be able to give it the results
of a search, and have it produce a relevance-
ranked table of summaries.
-
Text and HTML should be detected automatically based on document
content, not name or MIME type. This enables a summariser to detect
text documents which have been topped-and-tailed with HTML.
It should be able to convert visual elements in text (such as underlines
with hyphens or underbars, or headings with double line
feeds) and convert them to HTML equivalents.
- Finally the summariser should be highly configurable. For example:
- The appearance of the summariser should be controlled by a template
or similar mechanism.
- Images can be excluded
- Visual HTML elements can be suppressed
- The visual density can be controlled
- Forms can be ignored (the default) or not
- URLs are automatically transformed, if appropriate.
- Table of Contents in documents should be ignored, at least by default.
How is a summariser constructed?
The main technique that we use is called discussion flow analysis.
What other techniques exist for writing a summariser?
The main alternatives are statistical systems, or those attempting deep parsing of
natural language. Statistical systems tend to be over simplistic, wheras deep grammatical
systems either get it spectacularly right or even more spectacularly wrong!
Please see the technique comparison page.
What are the performance implications of the summariser?
Very little. The techniques the summariser uses are not those of natural language
parsers which require many minutes per sentence on a high performance work-station.
Consequently the CPU load imposed by the summariser is much less than that of other
server components.
Concept Engine
What is the difference between Concept Engine and a traditional Search Engine?
The main differences, from a users persepective, are:
-
Concept Engine is about precision, not recall. Most web search engines over-retrieve.
That is, they return hundreds or thousands of documents, most of which have very little
relevance to the query. Although very good documents may be in the retrieval list, they
might as well not be, because the list is so long. In other words, the search engine has
very little precision. Concept Engine, on the other hand, focuses on precision:- although
it may miss some relevant documents (although in fact it probably won't) it won't return
irrelevant documents, so its precision is high. Consequently, the results from Concept Engine
are much more useful (many highly relevant documents, few irrelevant documents) than the
results of a search engine (many irrelevant documents, few relevant documents).
-
Concept Engine uses natural language, rather than computer language. This makes it
much more useful to the end user, since they can focus on the job in hand, rather than
translating their requirements into complex computer language. Worse still, most search
engines work well using 'boolean' searches, which are actually highly inappropriate for
searching for text.
-
Concept Engine allows highly specific queries, so that exactly the right documents are found.
-
Concept Engine is about information research rather than text retrieval.
-
As well as being able to search for documents, or documents about people (and hence
people via 'personal pages'), Concept Engine can search for images, specified
using natural language queries.
-
Every page returned has been validated by Concept Engine, and will exist when you choose
to go there!
- Each page returned is properly summarised, with the summary being focused
on the original query.
What are the performance implications of Concept Engine?
Concept Engine uses very little CPU, but is mostly limited by the latency of the web.
As a result, one server can run many Concept Engine queries simultaneously.
General Questions
How do I write good queries for Software Scientific systems?
With the exception of ScriptSearch, all our technology uses natural language queries,
so there are no complex 'syntax' rules to be followed. However, there are still good
queries and bad queries, so we have provided a page to explain how to write an
effective query.