The work done in April and May was to finish the development of the cooperative indexers. The final design and implementation will be discussed here. The source code for these explorers is in the public domain and should be available shortly. A full technical report will also be made available.
Each communication act consists of a single transaction. A transaction consists of a correspond-ant sending a request to the account-ant and the account-ant sending back a response.
The following types of transaction are possible:CLAIM
url", where url is fully specified.
This request asks the question "May I explore this document?" The
possible responses are:
Okay" meaning "It is okay to explore that
document."
No" meaning "It is not okay to explore that
document (for some reason)."
Maybe" meaning "I do not have enough
information about that document." Operationally, this
currently means that the account-ant does not have any
exclusion information for the site and therefore cannot
grant permission to explore the document.
Wait" meaning "It is not okay to explore that
document now, but it might be later" (i.e. ask again).
Operationally, this curently means that the account-ant is
expecting to receive eclusion information for the site soon
from another correspond-ant.
UNCLAIM
url", where url is fully
specified. This request says "I could not explore this
document." The only possible response is:
Okay" meaning "Acknowledged."
ASSIGNMENT?". This
request asks the question "What document may I explore?" A
correspond-ant may send such as a request if it runs out of
documents to explore. The following responses are possible:
Assignment url" meaning "You may
explore this document."
No" meaning "I have no way of providing an
assignment (for some reason)."
Wait" meaning "I don't have any assignments
now, but I might later" (i.e. ask again).
EXCLUSION
url exclusion-url", where url
is the fully-specified URL of the document that led to the site
in question, and exclusion-url is a fully-specified
URL for a part of that site from which WebAnts are excluded.
This request tells the account-ant that it should not let
correspond-ants explore that part of the site.
exclusion-url may be omitted, in which case, the
request is taken to mean that there are no further restrictions
on that site. This case is usually used to specify that the
site has no restrictions. The only possible response is:
Okay" meaning "Acknowledged."
Two classes of ants are involved in exploration, an account-ant and a series of correspond-ants. These classes' functions are:
Prototype versions of both a correspond-ant and an account-ant have been built in Perl 4. Source code will be available shortly, pending some clean-up (a) to remove diagnostic code and (b) to make it more customizable (i.e. move some of the setting information to command-line argument).
Based on evaluation runs of the system on Sun Sparcstations at CMT, this system seems to be capable of fetching and doing simple information extraction on approximately 3000 documents/hour. This is much faster than a single indexer alone could manage (in my experience), which suggests that the cooperative strategy does produce the desired result in terms of speed.
It should be noted that for each document the system currently only saves the URL and the words with their corresponding frequencies; obviously, a richer, more complicated indexing mechanism would require additional time, which would reduce this time.
WebAnts is a founding member of Braustübl', a collective of web indexers, established during the Third WWW Conference in Darmstadt, Germany, in April 1995. The purpose of the collective is to provide a simple means by which members will share certain basic information, such as requested additions and deletions.
I have heard again from Darrell Woelk, head of the Infosleuth project at MCC, and he hopes to be stopping by in July for a visit.
WEBster magazine will be interviewing me (John Leavitt) on June 22 regarding my work on WebAnts.