WebAnts(tm) Progress Report -- May 1995

Cooperative Explorer Final Report

The work done in April and May was to finish the development of the cooperative indexers. The final design and implementation will be discussed here. The source code for these explorers is in the public domain and should be available shortly. A full technical report will also be made available.


Protocol

The following is the final version of the still unnamed explorer ant communication protocol.

Each communication act consists of a single transaction. A transaction consists of a correspond-ant sending a request to the account-ant and the account-ant sending back a response.

The following types of transaction are possible:
CLAIM
This request takes the form: "CLAIM url", where url is fully specified. This request asks the question "May I explore this document?" The possible responses are:
UNCLAIM
This request takes the form: "UNCLAIM url", where url is fully specified. This request says "I could not explore this document." The only possible response is:
ASSIGNMENT?
This request takes the form: "ASSIGNMENT?". This request asks the question "What document may I explore?" A correspond-ant may send such as a request if it runs out of documents to explore. The following responses are possible:
EXCLUSION
This request takes the form: "EXCLUSION url exclusion-url", where url is the fully-specified URL of the document that led to the site in question, and exclusion-url is a fully-specified URL for a part of that site from which WebAnts are excluded. This request tells the account-ant that it should not let correspond-ants explore that part of the site. exclusion-url may be omitted, in which case, the request is taken to mean that there are no further restrictions on that site. This case is usually used to specify that the site has no restrictions. The only possible response is:

Design

Two classes of ants are involved in exploration, an account-ant and a series of correspond-ants. These classes' functions are:

Correspond-Ant
This ant handles all of the exploration. It explores documents, extracts summary information, and maintains a queue of documents to be explored. It communicates with the account-ant before exploring any document, in order to coordinate its efforts with its peers.
Account-Ant
This ant handled all of the cooperation. It keeps track of which documents have been explored and which sites exclude WebAnts from certain documents. It also coordinates the efforts of the correspond-ants by (a) not granting permission to explore a previously explored document, (b) asking correspond-ants for exclusion information for a given site if needed, (c) giving documents to correspond-ants whose queues run dry.

Implementation

Prototype versions of both a correspond-ant and an account-ant have been built in Perl 4. Source code will be available shortly, pending some clean-up (a) to remove diagnostic code and (b) to make it more customizable (i.e. move some of the setting information to command-line argument).


Performance

Based on evaluation runs of the system on Sun Sparcstations at CMT, this system seems to be capable of fetching and doing simple information extraction on approximately 3000 documents/hour. This is much faster than a single indexer alone could manage (in my experience), which suggests that the cooperative strategy does produce the desired result in terms of speed.

It should be noted that for each document the system currently only saves the URL and the words with their corresponding frequencies; obviously, a richer, more complicated indexing mechanism would require additional time, which would reduce this time.

News

WebAnts is a founding member of Braustübl', a collective of web indexers, established during the Third WWW Conference in Darmstadt, Germany, in April 1995. The purpose of the collective is to provide a simple means by which members will share certain basic information, such as requested additions and deletions.

I have heard again from Darrell Woelk, head of the Infosleuth project at MCC, and he hopes to be stopping by in July for a visit.

WEBster magazine will be interviewing me (John Leavitt) on June 22 regarding my work on WebAnts.


Last updated 21-Jun-95 by John Leavitt (jrrl@cmu.edu)