next up previous
Next: User Interface Up: The MetaCrawler Softbot Previous: Post-Processing References

Collation and Duplicate Elimination

Detecting duplicates references is difficult without the full contents of a particular page, due to host name aliases, symbolic links, redirections, and other forms of obfuscation. To eliminate duplicates, MetaCrawler uses a sophisticated comparison algorithm. When comparing two URLs, it first checks that the domains are the same (i.e. is www.cs.washington.edu really bauhaus.cs.washington.edu?). It then checks if the paths are the same using standard aliases (e.g. a URL of the form http://www.some.dom/foo often refers to http://www.some.dom/foo/index.html). Finally, if the domains are the same and the paths aren't, it looks at the titles of the pages. If they are the same, it assumes that they are aliases, although rather than deleting the reference it puts the link beneath the original. MetaCrawler is able to make an even more concrete determination if it can download the page and compare the full text. There are issues that still need to be resolved, such as mirrored pages, but our algorithm has worked well enough in practice.

MetaCrawler uses a confidence score to determine how close a reference matches the query. A higher confidence score indicates a more relevant document. To calculate each references' confidence score, MetaCrawler first distributes the confidence scores returned by each service into the range 0 to 1000. Thus, the top pick from each service will have a confidence score of 1000. Then, the MetaCrawler eliminates duplicates, and adds to the duplicated reference's score the sum of the removed references confidence scores. In essence, this allows services to vote for the best reference, as a reference returned by three services will most likely have a higher total than a reference returned by one. In addition, we provide the option of seeing the results sorted by location rather than relevance.


next up previous
Next: User Interface Up: The MetaCrawler Softbot Previous: Post-Processing References
Erik W. Selberg 2003-05-25