One of the advantages of Rocchio's technique is that the user only has to judge whether an entire document is relevant or not, and the system will infer which terms or attributes corresponded to that relevance judgment. His particular implementation expands the query to include all potentially relevant terms, and then focuses on calculating the appropriate weight for each term. Unfortunately, one of the side effects of his system is that reformulated queries tend to get large and include words that are arguably irrelevant, typically function words with no inherent meaning. The result of this is that while Relevance Feedback is able to increase recall, precision suffers as irrelevant documents are obtained.
As part of the IRX [17, ] project, Donna Harman conducted a study to determine if a small set of highly relevant words could be found and if that set of words would produce better results than queries which incorporated all possibly relevant terms. Harman does this by looking at three different techniques for obtaining relevant terms: Relevance Feedback, term variation, and nearest neighbor routines. Of note in her work is that the IRX system does not use stemming, but indexes exact terms.
To determine if it is feasible to obtain a small set of relevant terms and generate better results than massive term expansion, Harman ran an experiment using the Cranfield 1400 test collection, which contains 1400 abstracts related to aerospace, and 225 related queries with relevance judgments. In conducting this experiment, no term weighting is done, and thus all terms have equal sway in determining relevance.
For her experiment, Harman defines the following:
Harman also uses the above definitions of term frequency, document
frequency, and noise restricted to R and
:
The primary goal of Harman's experiment is to determine if a small
set of relevant terms could be obtained to create better
queries. Harman hypothesizes that the following selection formulas
will satisfy that goal. Each formula below returns a
numeric value for term
. All terms are sorted by this ranking from largest value to
smallest, and then the appropriate number are added to the query.
Upon the submission of a query, the experiment runs as follows:
The results from this experiment are shown in
![]()
Table 1: Results from Harman Feedback
Experiment
Table 1. Of note in the table are the two rows labeled ``% improvement'' which measure the performance of each selection formula. They indicate that there is a significant difference in performance, depending on which sorting formula is used, with the best sorting formula being noise * tf * postings, at least for the Cranfield 1400.
In addition to sorting, the user can be consulted as an additional filter to determine true relevance. Harman describes an experiment with simulated user selection whereby a 31% improvement in performance can be obtained, with an average of 12 terms being added to the query instead of 20. She does admit that this is the result of a ``perfect choice'' by the user and unlikely in the real world setting, although a 1996 user study by Koenemann and Belkin [25] demonstrates that user filtration of the query terms increases performance about 30%.
Harman also describes a secondary experiment in varying the number of terms added to the query from 10 through 40 (in increments of 10). Her findings indicate that performance increases when 10 terms are added, and again when 20 are added. However, for 30 and 40 terms, she observes that the performance using the best sorting methods decreased, and while the performance of the worse methods still increased, it was never superior than the better sorts. This phenomenon appears to be a classic case of overfitting.
Harman carries out two further experiments to again find 20 terms to add to the query. The specifics of these experiments are not directly germane to this paper, and thus I will just summarize the results. The first is to find 20 term variants using stemmed words. The second is to find 20 best nearest neighbors, where the nearest neighbors of a term are the terms adjacent to it in the document. Harman finds that adding arbitrary stemmed words to the query decreases performance, and performance increases only when terms were filtered via the simulated user selection described above. Similarly, adding arbitrary nearest neighbor terms decreases performance, but performance increases by adding terms filtered with simulated user selection.
The results of Harman's experiments shows that one can achieve a similar level of performance by expanding a query by 20 feedback terms as one can by expanding the query using all feedback terms. Furthermore, she shows that by adding more than 20 terms, performance degrades, at least for the Cranfield collection. She also shows that performance can be further enhanced by giving user direct control over the specific terms to be added to a query. The tradeoff for this higher performance comes in the form of increased user interaction, as the user interface requires users to make judgment calls on both documents and subsequently terms. However, taken together, the results present a strong case for systems that attempt to select the best terms for feedback and filter those through a user.