next up previous
Next: Term Selection using Traditional Up: Two Probabilistic Systems Previous: Term Weighting in a

Term Selection in a Probabilistic Model

 

In addition to term weighting, term selection also appears to be equally germane to probabilistic models as to vector models. Similar to Harman's work in Section 4.2, Haines and Croft conducted experiments attempting to ascertain if any term selection methods worked better than others, and if the number of terms added affected performance in any significant way.

  Haines and Croft evaluate the performance increase by adding a variable number of terms to queries, from 0 to 150 for the CACM collection and 100 for the WEST collection. They find that for the CACM collection performance is increased by adding any number of terms, although this performance increase begins to plateau at about 40 terms. For the WEST collection, they find a similar result to Harman in that after 20 to 30 terms, performance began to slowly degrade.

Haines and Croft also experiment with a variety of term selection formulas. As in Section 4.2, each of the ranking formulae below return a ranking for a term tex2html_wrap_inline2205. All terms are then sorted by this ranking, and the appropriate number are added to the query.

EMIM
The expected mutual information [52, ], defined by:
eqnarray435
where P(i, j) is the probability of both i and j occuring in a document judged relevant, tex2html_wrap_inline2535 corresponds to term tex2html_wrap_inline2205 appearing or not appearing in a given document, and tex2html_wrap_inline2539 corresponds to a given document being relevant or irrelevant. EMIM was originally derived to take into account term dependence.
PMIM
This is EMIM when the term is present and the given document is relevant, e.g.
eqnarray443
P_4
This probability that the document will be relevant (or irrelevant) depending on the occurrence (or absence) of term tex2html_wrap_inline2205
eqnarray447
idf
The Inverse Document Frequency, defined as
eqnarray454
rdfidf
The product of the term's relevant document frequency and the inverse document frequency, defined as
eqnarray458
rtf
as defined by Equation (12).
rtfidf
as defined by Equation (13).

While Haines and Croft find that using a ranking formula improved performance, they do not find that any of the above formulas stood out from the rest. Two hypotheses come to mind:

  1. The structure of INQUERY is robust enough to give similar results regardless of the term selection methods used;
  2. What is important is selecting only a limited number of terms, and any reasonable ranking formula will do.

Fully investigating the first hypothesis is beyond the scope of this article. Currently, INQUERY uses rtfidf for term selection [4], which would seem to indicate that unpublished empirical evidence suggests that rtfidf does in fact perform better than others. As for the second, I will now examine an approach taken with the traditional probabilistic model.


next up previous
Next: Term Selection using Traditional Up: Two Probabilistic Systems Previous: Term Weighting in a

Erik Selberg
Wed Aug 6 12:24:17 PDT 1997