next up previous
Next: Inference Networks Up: Probabilistic Model of IR Previous: Probabilistic Model of IR

Robertson / Sparck Jones Formalization

Robertson and Sparck Jones describe a probabilistic system as a system that determines the weights of terms in document and query vectors probabilisticly [24]. Using log likelihoods, they derive the following expression for the ranking of a document based on sum of probabilisticly-derived weights:
 eqnarray101
where tex2html_wrap_inline2251 is the probability that term tex2html_wrap_inline2149 is in a relevant document and tex2html_wrap_inline2255 is the probability that tex2html_wrap_inline2149 is in a non-relevant document. m is the number of terms in the collection. The question now comes to determining how to calculate tex2html_wrap_inline2251 and tex2html_wrap_inline2255.

For a given term tex2html_wrap_inline2149 and a query Q, Robertson and Sparck Jones define:

N
The number of documents in the collection, i.e. |D|
R
The number of relevant documents, i.e. tex2html_wrap_inline2275
tex2html_wrap_inline2277
The number of documents containing term tex2html_wrap_inline2149, i.e. tex2html_wrap_inline2281
tex2html_wrap_inline2283
The number of relevant documents containing tex2html_wrap_inline2149, i.e. tex2html_wrap_inline2287
and create the following contingency table used to determine the number of relevant and irrelevant documents when a given term is present or absent:
tabular115

Using this table, tex2html_wrap_inline2251 and tex2html_wrap_inline2255, the probabilities of a document being relevant when a term is or is not present, are defined by Robertson and Sparck Jones as
eqnarray119
and by substituting back into Equation (4) they arrive at
eqnarray122
also referred to as the f4 formula. In order to avoid undefined weights when tex2html_wrap_inline2277 is 0, a common case in operational systems, a factor of 0.5 is added to each term, resulting in the f4 point 5 formula:
eqnarray137

N and tex2html_wrap_inline2277 are easily obtainable. However, since tex2html_wrap_inline2321 is unknown to the system, it is necessary to estimate R and tex2html_wrap_inline2283 for the initial query. There are a variety of estimation methods available [21, ] for initial queries, although the common technique is to use the f4 point 5 or similar [36] formula and set R and tex2html_wrap_inline2283 to 0 resulting in an initial uniform weight for all terms in a query.

Relevance Feedback is implemented in this system by successively updating R and tex2html_wrap_inline2283 after each iteration. The theory is that after enough iterations, the values for tex2html_wrap_inline2251 and tex2html_wrap_inline2255 will converge upon the true weights, and thus the ranking formula defined by Equation (4) will produce the full set of relevant documents tex2html_wrap_inline2321.


next up previous
Next: Inference Networks Up: Probabilistic Model of IR Previous: Probabilistic Model of IR

Erik Selberg
Wed Aug 6 12:24:17 PDT 1997