In addition to term weighting, term selection also appears to be equally germane to probabilistic models as to vector models. Similar to Harman's work in Section 4.2, Haines and Croft conducted experiments attempting to ascertain if any term selection methods worked better than others, and if the number of terms added affected performance in any significant way.
Haines and Croft evaluate the performance increase by adding a variable number of terms to queries, from 0 to 150 for the CACM collection and 100 for the WEST collection. They find that for the CACM collection performance is increased by adding any number of terms, although this performance increase begins to plateau at about 40 terms. For the WEST collection, they find a similar result to Harman in that after 20 to 30 terms, performance began to slowly degrade.
Haines and Croft also experiment with a variety of term selection
formulas. As in Section 4.2, each of the ranking
formulae below return a ranking for a term
. All terms are then
sorted by this ranking, and the appropriate number are added to the
query.
corresponds to
term
While Haines and Croft find that using a ranking formula improved performance, they do not find that any of the above formulas stood out from the rest. Two hypotheses come to mind:
Fully investigating the first hypothesis is beyond the scope of this article. Currently, INQUERY uses rtfidf for term selection [4], which would seem to indicate that unpublished empirical evidence suggests that rtfidf does in fact perform better than others. As for the second, I will now examine an approach taken with the traditional probabilistic model.