Luhn’s Point of View: Median-Based Term Weighting Schemes
Abstract views: 80 / PDF downloads: 161Keywords:
Information retrieval, indexing, term importanceAbstract
In this study we replace the TF component of the TFxIDF term weighting method with a parameter derived from Luhn’s claim on term
importance. Luhn claims that the words with the mid frequencies are the most important ones, and the importance of a word fall as the frequency of
the word increases or decreases. We take the median frequency of the words in a document as the base and assess the importance of a word by the
difference between its frequency and the median frequency. The weighting functions are varied by two normalization approaches as using median
itself and standard deviation of medians and tested on TREC-6 through TREC-8 adhoc tracks. The experimental results of the weightings using
median itself, perform better retrieval than basic TFxIDF and BM25 with respect to MAP and R-P measures.