Tangled Webs
Searching for Meaning
May 21, 1997
Issue 2.7

How Many Hits!!??

Internet search engines are getting easier and easier to use but less and less useful. The Net is simply outgrowing them. This state of affairs was driven home to me again recently when researching a particular security hole in Windows NT. Most search engines returned over 3,000 documents which their algorithms deemed relevant. A handful of them actually were.

This information glut is not limited to technical subjects. Looking for critical analysis of Byron's Don Juan turned up anywhere from 250 to 2,900 pages -- far too many to be of use, and many with no connection to the poem.

We are drowning in data, yet search sites advertise by proclaiming how many millions of documents they have indexed. The number of web pages indexed is an easy metric to understand, but it is a poor measure of the search engines' worth. Finding out that 60,000 web pages match your search criteria is not helpful. I don't want more documents. I want the right documents.

In many cases, of course, it is possible to prune search results to a reasonable number of documents by using restrictive keywords and Boolean logic (using the AND, OR, and NEAR terms). This works well when you are looking for a specific piece of information, but not if you want general information about a subject. The techniques invariably trim out a lot of useful documents. A new approach is called for.

The Next Generation
As a developer, I take particular pleasure in pontificating on how software I will never have to write should work. So let's look at how the ideal search engine should function.

The ideal search engine would act a lot like the ideal librarian. It would not only understand the content and quality of the documents in its library, but the questions asked of it. When given general search terms, it would return documents containing general information on the topic rather than simply returning more documents. When asked for information on the known security holes in Windows NT, it most certainly should not direct the user to an advertisement that mentions NT security in passing.

This sort of search engine may sound like something that would require a HAL 9000, but in the last few years remarkable progress has been made in this direction. Eventually all search engines will have to work this way to some degree. Requiring uses to learn increasingly complex Boolean logic to run simple searches is clearly not a viable long-term strategy.

Things to Come
Most of today's search engines are little more than word indexes. The engines understand neither the context in which the words occur, nor their meanings. When asked about spiders, the engines will miss documents that discuss arachnids, but find documents on how to program Internet robots, which are also called spiders. Likewise, they will not return a document containing the word Rome in a search for information on Italy or Europe.

Architext Software, creators of the Excite search engine, are making impressive attempts at overcoming these limitations. The Excite engine uses a thesaurus to eliminate homonyms and include synonyms in its searches, and uses the relative proximity of words to try to divine what documents are about. Any measure of success is necessarily subjective, but I find that Excite returns a higher percentage of relevant documents than word indexes like Altavista.

The most impressive technology I have seen to date, however, is Oracle's ConText engine. To my knowledge, no one is yet using ConText to catalog the web, but it is widely used in conjunction with Oracle databases. ConText breaks down sentences grammatically to determine the function of the words and how they relate to one another. Like Architext, it uses a thesaurus to deal with homonyms and synonyms, but it takes the idea a step further. ConText uses a detailed concept database to determine what documents are actually about. It knows, for example, that Rome is in Italy and that Ethernet is used in networking computers, and it matches documents accordingly.

Of course, even the ConText engine only gives the illusion of understanding. It must ultimately must draw its conclusions from the words contained in the documents. Much of human communication, however, involves metaphor and symbolism, and that is far beyond what even the most sophisticated concept database can handle. The best engines correctly classify articles on network security, but they haven't a clue what Don Juan is about.

[ Home Page] [ Back to Index ] [ Previous Issue ] [ Next Issue ]

© Copyright 1997, Tim Romero, t3@t3y.com
Tangled Webs may be distributed freely provided this copyright notice is included.
The Tangled Webs Archive is located at http://www.dotco.com/t3/tangledwebs/index.shtml