next up previous
Next: Term Selection in a Up: Two Vector-Based Systems Previous: Two Vector-Based Systems

Term Weighting in a Vector Model

Rocchio's technique of feedback is based on the approximation of an optimal query. In order to describe his technique, it is first necessary to recite his definition of an optimal query.

Given a document collection D and a query Q, Rocchio assumes the existence of tex2html_wrap_inline2357, which is the relevant set of documents in D to Q. He further defines an ideal request as a query that ranks all elements of tex2html_wrap_inline2321 above all other elements in D. However, he is quick to point out that the ideal request is unlikely to exist for all queries. Thus, he defines an optimal request query tex2html_wrap_inline2367, which is the query that maximizes the similarity of the relevant documents and minimizes the similarity of the irrelevant documents. In mathematical terms, Rocchio defines C as
eqnarray176
tex2html_wrap_inline2367 is the vector that corresponds to Q when C is maximal. To find this, Rocchio uses the cosine similarity function defined by Equation (2) to determine the correlation between a query and a given document. Substituting for r(Q, d) he derives
 eqnarray186
Rocchio observes that C is maximized when Q = kA for any arbitrary scalar k. Thus,
eqnarray217

While the formula for tex2html_wrap_inline2367 shows that there exists an optimal query for a given set of relevant documents tex2html_wrap_inline2321, this formula does not help retrieve those documents, as knowledge of tex2html_wrap_inline2321 obviates the need for retrieval and making tex2html_wrap_inline2367 in the first place.

Rocchio then defines his Relevance Feedback technique as an iterative process whereby the user begins with query tex2html_wrap_inline2123 that returns a set of documents tex2html_wrap_inline2127, and continues to reformulate the query tex2html_wrap_inline2397 until the user is satisfied that tex2html_wrap_inline2349 is tex2html_wrap_inline2321. The question that now arises is what is the process by which tex2html_wrap_inline2403 becomes tex2html_wrap_inline2405?

Rocchio describes the abstract mathematical model as
eqnarray233
where R is the set of documents the user has judged relevant, and S the set of irrelevant documents. Rocchio suggests that f should be
eqnarray236
or alternatively
 eqnarray253
where Equation (8) is the form commonly seen in subsequent literature.

Rocchio made the initial evaluation of his method using 17 requests on the SMART system. He made some modifications to the technique; in particular, since SMART was unable to handle negative weights, any negative weight was set to 0. His results were very promising, showing approximately a 10% increase in precision for every level of recall when using Relevance Feedback, although due to the extremely limited sample size the results could not be taken as statisticly significant.

Ide conducted some experimental work on Rocchio's formula, as well as some modifications, again using the SMART system [22]. He verified that Rocchio's system did in fact result in an increase in performance. He also suggested a modification of Rocchio's formula which is often seen, commonly called Ide Dec-Hi:
eqnarray278
where tex2html_wrap_inline2413 is the first non-relevant document retrieved. Ide Dec-Hi is commonly thought to be the best pure vector-based formula available [43].

The result of Rocchio's work is an elegant and simple way of incorporating relevance information to refine queries. The user interface necessary requires only that a user mark documents as either relevant or irrelevant, and the system infers exactly what components of those documents make it relevant or irrelevant. It is also computationally feasible, being linear in the number of documents retrieved and terms per document. All together, this technique combines a strong mathematical foundation with a useful practical method.


next up previous
Next: Term Selection in a Up: Two Vector-Based Systems Previous: Two Vector-Based Systems

Erik Selberg
Wed Aug 6 12:24:17 PDT 1997