When comparing techniques, there needs to be some objective method of evaluation in order to determine which is better under some set of conditions. IR systems are typically evaluated using precision and recall. For any set of documents retrieved, precision is the percentage of those documents that are relevant, and recall is the percentage of relevant documents retrieved out of all relevant documents. In this paper, I will measure performance as the average increase in precision at a fixed recall. Typically, an improvement in recall comes at the expense of a degradation in precision, and vice versa. Relevance Feedback can be thought of as a method of improving performance by increasing precision, or slowing its degradation, when recall is increased.
Unfortunately, while precision and recall afford a reasonable method of comparing two systems, in practice published results using precision and recall are generally uncomparable, most often due to modifications made to standard test collections or in alternative methods of choosing documents used in calculations. For systems using Relevance Feedback, it is also difficult to compare results because there are a variety of methods of ascertaining the improvement caused by feedback [7, ]. To that end, I will be conservative in stating that one system is generally better than another, as well as in predicting the potential gain in creating a system using techniques not actually combined.
A final note regarding comparison, which comes into play when discussing deployment of an IR system, is in regards to the evaluation metric. In IR, typically the evaluation question being asked is ``How well does the system satisfy the query?'' However, there are other metrics that may be important. For instance: