next up previous
Next: Two Probabilistic Systems Up: Two Vector-Based Systems Previous: Term Weighting in a

Term Selection in a Vector Model

  One of the advantages of Rocchio's technique is that the user only has to judge whether an entire document is relevant or not, and the system will infer which terms or attributes corresponded to that relevance judgment. His particular implementation expands the query to include all potentially relevant terms, and then focuses on calculating the appropriate weight for each term. Unfortunately, one of the side effects of his system is that reformulated queries tend to get large and include words that are arguably irrelevant, typically function words with no inherent meaning. The result of this is that while Relevance Feedback is able to increase recall, precision suffers as irrelevant documents are obtained.

As part of the IRX [17, ] project, Donna Harman conducted a study to determine if a small set of highly relevant words could be found and if that set of words would produce better results than queries which incorporated all possibly relevant terms. Harman does this by looking at three different techniques for obtaining relevant terms: Relevance Feedback, term variation, and nearest neighbor routines. Of note in her work is that the IRX system does not use stemming, but indexes exact terms.

To determine if it is feasible to obtain a small set of relevant terms and generate better results than massive term expansion, Harman ran an experiment using the Cranfield 1400 test collection, which contains 1400 abstracts related to aerospace, and 225 related queries with relevance judgments. In conducting this experiment, no term weighting is done, and thus all terms have equal sway in determining relevance.

For her experiment, Harman defines the following:

tex2html_wrap_inline2203
The frequency of term tex2html_wrap_inline2205 in document tex2html_wrap_inline2159.
tex2html_wrap_inline2425
  The frequency of term tex2html_wrap_inline2205 in D, e.g. tex2html_wrap_inline2431
tex2html_wrap_inline2433
The noise of term tex2html_wrap_inline2205, defined by tex2html_wrap2415
tex2html_wrap_inline2437
The maximal value of tex2html_wrap_inline2433 for all terms tex2html_wrap_inline2205
r(Q, d)
For this experiment, Harman sets r to be:
 eqnarray300
R
The relevant documents obtained after a query has been submitted.
postings
Relevant documents that contain a given term, e.g. tex2html_wrap_inline2449.

Harman also uses the above definitions of term frequency, document frequency, and noise restricted to R and tex2html_wrap_inline2449:

tex2html_wrap_inline2455
The frequency of term tex2html_wrap_inline2205 in document tex2html_wrap_inline2459.
tex2html_wrap_inline2461
the frequency of term tex2html_wrap_inline2205 in R, e.g. tex2html_wrap_inline2467. Referred to as frequency within postings.
tex2html_wrap_inline2469
Noise within postings, as defined by tex2html_wrap2416

The primary goal of Harman's experiment is to determine if a small set of relevant terms could be obtained to create better queries. Harman hypothesizes that the following selection formulas will satisfy that goal. Each formula below returns a numeric value for term tex2html_wrap_inline2205. All terms are sorted by this ranking from largest value to smallest, and then the appropriate number are added to the query.

noise
The noise of term tex2html_wrap_inline2205. The sorting semantics of noise are from lowest to highest, thus the inverse is used.
 eqnarray325
number of postings
The raw number of relevant documents containing the term tex2html_wrap_inline2205.
eqnarray330
noise within postings
The noise of term tex2html_wrap_inline2205 restricted to relevant documents containing term tex2html_wrap_inline2205.
 eqnarray333
noise * tf within postings
The noise of term tex2html_wrap_inline2205 multiplied by the term frequency of tex2html_wrap_inline2205 within relevant documents containing tex2html_wrap_inline2205.
eqnarray338
noise * tf * postings
The noise of term tex2html_wrap_inline2205 multiplied by the term frequency of tex2html_wrap_inline2205 within all documents multiplied by the number of relevant documents containing tex2html_wrap_inline2205.
eqnarray342
noise * tf
The noise of term tex2html_wrap_inline2205 multiplied by the term frequency of tex2html_wrap_inline2205 within all documents.
 eqnarray347

Upon the submission of a query, the experiment runs as follows:

  1. Obtain the set of top ten documents D' using a set ranking formula r;
  2. Simulate user interaction by marking all relevant documents, as determined by the Cranfield relevance judgments. Let R be the set of documents within D' that are relevant.
  3. Create a list l of all non-common terms from relevant documents in R;
  4. For each selection function:
       
    1. Sort list l using the given selection function;
    2. Add 20 terms from the top of the sorted list to the query;
    3. Re-submit the new query;
    4. Evaluate.

The results from this experiment are shown in

 table358
Table 1:  Results from Harman Feedback Experiment

Table 1. Of note in the table are the two rows labeled ``% improvement'' which measure the performance of each selection formula. They indicate that there is a significant difference in performance, depending on which sorting formula is used, with the best sorting formula being noise * tf * postings, at least for the Cranfield 1400.

In addition to sorting, the user can be consulted as an additional filter to determine true relevance. Harman describes an experiment with simulated user selection whereby a 31% improvement in performance can be obtained, with an average of 12 terms being added to the query instead of 20. She does admit that this is the result of a ``perfect choice'' by the user and unlikely in the real world setting, although a 1996 user study by Koenemann and Belkin [25] demonstrates that user filtration of the query terms increases performance about 30%.

Harman also describes a secondary experiment in varying the number of terms added to the query from 10 through 40 (in increments of 10). Her findings indicate that performance increases when 10 terms are added, and again when 20 are added. However, for 30 and 40 terms, she observes that the performance using the best sorting methods decreased, and while the performance of the worse methods still increased, it was never superior than the better sorts. This phenomenon appears to be a classic case of overfitting.

Harman carries out two further experiments to again find 20 terms to add to the query. The specifics of these experiments are not directly germane to this paper, and thus I will just summarize the results. The first is to find 20 term variants using stemmed words. The second is to find 20 best nearest neighbors, where the nearest neighbors of a term are the terms adjacent to it in the document. Harman finds that adding arbitrary stemmed words to the query decreases performance, and performance increases only when terms were filtered via the simulated user selection described above. Similarly, adding arbitrary nearest neighbor terms decreases performance, but performance increases by adding terms filtered with simulated user selection.

The results of Harman's experiments shows that one can achieve a similar level of performance by expanding a query by 20 feedback terms as one can by expanding the query using all feedback terms. Furthermore, she shows that by adding more than 20 terms, performance degrades, at least for the Cranfield collection. She also shows that performance can be further enhanced by giving user direct control over the specific terms to be added to a query. The tradeoff for this higher performance comes in the form of increased user interaction, as the user interface requires users to make judgment calls on both documents and subsequently terms. However, taken together, the results present a strong case for systems that attempt to select the best terms for feedback and filter those through a user.


next up previous
Next: Two Probabilistic Systems Up: Two Vector-Based Systems Previous: Term Weighting in a

Erik Selberg
Wed Aug 6 12:24:17 PDT 1997