WebAnts(tm) Problem Definition

The problem of information discovery within the World Wide Web (the web) is vast. While it is impossible to get an completely accurate count of the number of documents available via the web (including ftp and gopher servers), it is easily in the millions. This is testified to by the fact that the Lycos search engine knows of approximately one million documents and its creator, Dr. Michael Mauldin, estimates that the current [12/94] number may be closer to five times that. And the web continues to grow. It is simply no longer practical for users to just wander the web when searching for something.

As a result, users have come to depend on search engines to help them find on-line resources. There are a number of different search engines available via the web, using a variety of methods to build their underlying databases. On one end of the spectrum are engines that rely entirely on individual servers providing self-indexing information, such as Martijn Koster's Aliweb. This approach has very much fallen out of vogue lately, due largely to its requirement of cooperation from every server. As many (and apparently most) server managers have not proven willing and/or able to make the required effort, databases produced by this method are invariably far from complete. At the other end of the spectrum are pro-active engines, which use software robots to index large portions of the web, such as Lycos. Engines of this class tend to have more complete databases, but even the most comprehensive (the aforementioned Lycos) does not nearly provide full indexing off the entire web. The reason for this is simple: resource constraints.

If discovering information in the web is a big problem, indexing the entire web is enormous. Not only must every document in the web be retrieved, but some portion of each document must be saved as a way of summarizing its contents for later retrieval. Different indexing schemes save different information from documents, but the problem facing every search engine is the same: how to manage such a vast amount of information. If, say, 1k is saved for each document, the resulting index exceeds a gigabyte. The trade-off becomes one of quality of index versus coverage of documents. Saving more information per document reduces the number of documents than can be covered, and vice versa. This trade-off, which is unfortunate in that it guarantees that a database will be in some way inadequate, either in scope or quality, is endemic to the current generation of search engines.


Back up to the WebAnts home page

Last updated 06-Feb-95 by John Leavitt (jrrl@cmu.edu)