The work done in April and May was to finish the development of the cooperative indexers. The final design and implementation will be discussed here. The source code for these explorers is in the public domain and should be available shortly. A full technical report will also be made available.
Each communication act consists of a single transaction. A transaction consists of a correspond-ant sending a request to the account-ant and the account-ant sending back a response.The following types of transaction are possible:
CLAIMurl", where url is fully specified. This request asks the question "May I explore this document?" The possible responses are:
Okay" meaning "It is okay to explore that document."
No" meaning "It is not okay to explore that document (for some reason)."
Maybe" meaning "I do not have enough information about that document." Operationally, this currently means that the account-ant does not have any exclusion information for the site and therefore cannot grant permission to explore the document.
Wait" meaning "It is not okay to explore that document now, but it might be later" (i.e. ask again). Operationally, this curently means that the account-ant is expecting to receive eclusion information for the site soon from another correspond-ant.
UNCLAIMurl", where url is fully specified. This request says "I could not explore this document." The only possible response is:
Okay" meaning "Acknowledged."
ASSIGNMENT?". This request asks the question "What document may I explore?" A correspond-ant may send such as a request if it runs out of documents to explore. The following responses are possible:
Assignmenturl" meaning "You may explore this document."
No" meaning "I have no way of providing an assignment (for some reason)."
Wait" meaning "I don't have any assignments now, but I might later" (i.e. ask again).
EXCLUSIONurl exclusion-url", where url is the fully-specified URL of the document that led to the site in question, and exclusion-url is a fully-specified URL for a part of that site from which WebAnts are excluded. This request tells the account-ant that it should not let correspond-ants explore that part of the site. exclusion-url may be omitted, in which case, the request is taken to mean that there are no further restrictions on that site. This case is usually used to specify that the site has no restrictions. The only possible response is:
Okay" meaning "Acknowledged."
Two classes of ants are involved in exploration, an account-ant and a series of correspond-ants. These classes' functions are:
Prototype versions of both a correspond-ant and an account-ant have been built in Perl 4. Source code will be available shortly, pending some clean-up (a) to remove diagnostic code and (b) to make it more customizable (i.e. move some of the setting information to command-line argument).
Based on evaluation runs of the system on Sun Sparcstations at CMT, this system seems to be capable of fetching and doing simple information extraction on approximately 3000 documents/hour. This is much faster than a single indexer alone could manage (in my experience), which suggests that the cooperative strategy does produce the desired result in terms of speed.
It should be noted that for each document the system currently only saves the URL and the words with their corresponding frequencies; obviously, a richer, more complicated indexing mechanism would require additional time, which would reduce this time.
WebAnts is a founding member of Braustübl', a collective of web indexers, established during the Third WWW Conference in Darmstadt, Germany, in April 1995. The purpose of the collective is to provide a simple means by which members will share certain basic information, such as requested additions and deletions.
I have heard again from Darrell Woelk, head of the Infosleuth project at MCC, and he hopes to be stopping by in July for a visit.
WEBster magazine will be interviewing me (John Leavitt) on June 22 regarding my work on WebAnts.