next up previous
Next: Collation and Duplicate Elimination Up: The MetaCrawler Softbot Previous: Understanding Query and Output

Post-Processing References

Supporting a rich feature set can be problematic. Some services have all the features that MetaCrawler uses, whereas others are lacking in particular areas such as phrase searching or weighing words differently. MetaCrawler implements these and other features not available from any service by downloading the pages referred by each service and analyzing them directly. The time required to download all pages in parallel is reasonable, usually adding no more than a couple of minutes to the search time. In addition, MetaCrawler is able to process results that have arrived while waiting for others. Therefore, minimal computation time required after all the references have been retrieved. It is important to note that while this does improve quality substantially, it is a user-enabled option1 as most users prefer to have the results displayed quickly.

Downloading and analyzing references locally has proven a powerful tool. We can implement features not found in some services. For example, Lycos does not currently handle phrase searching. MetaCrawler simulates phrase searching by sending Lycos a query for documents containing all the query words, downloads the resulting pages, and extracts those pages that actually contain the appropriate phrase. Using similar methods, we are able to handle other features commonly found in some, but not all, search services, such as requiring words to be either present or absent in pages.

In addition to simulating features on a particular search service, we are able to implement new features. The most obvious is that by downloading a reference in real time, we have verified that it exists. This has proven to be a popular feature in itself. We are also able to analyze the pages for sophisticated ranking and clustering algorithms, and are able to extract words from the documents that may be useful for refining the query.


next up previous
Next: Collation and Duplicate Elimination Up: The MetaCrawler Softbot Previous: Understanding Query and Output
Erik W. Selberg 2003-05-25