16

How do I calculate tf-idf for a query? I understand how to calculate tf-idf for a set of documents with following definitions:

tf = occurances in document/ total words in document

idf = log(#documents / #documents where term occurs

But I don't understand how that correlates to queries.

For example, I read a resource that stated the values of a query "life learning"

life | tf = .5 | idf = 1.405507153 | tf_idf = 0.702753576
learning | tf = .5 | idf = 1.405507153 | tf_idf = 0.702753576

The tf values I understand, each term appears only once out of the two possible terms, thus 1/2, But I have no idea where the idf comes from.
I would think that #documents = 1 and occurrence = 1, log(1) = 0, so idf would be 0, but this doesn't seem to be the case. Is it based on whatever documents you're using? How do you calculate tf-idf for a query?

Amir
  • 16,067
  • 10
  • 80
  • 119
Codarus
  • 437
  • 1
  • 5
  • 16

3 Answers3

7

Assume your query is best car insurance, your total vocabulary contains car, best, auto, insurance and you have N=1,000,000 documents. So your query is something like below:

enter image description here

And one of your document could be:

enter image description here

Now calculate cosine similarity between TF-IDF of your Query and Document.

Amir
  • 16,067
  • 10
  • 80
  • 119
  • I find this example really confusing. I know this is taken from Stanford lecture notes and thus it should be valid but isn't idf made to be part of the "document part" like @hypnoticpoisons stated in his answer? From this example it looks like it's part of a "query part" which makes no sense to me. – Banik Apr 25 '22 at 17:19
6

Only tf(life) depends on the query itself. However, the idf of a query depends on the background documents, so idf(life) = 1+ ln(3/2) ~= 1.405507153. That is why tf-idf is defined as multiplying a local component (term frequency) with a global component (inverse document frequency).

hypnoticpoisons
  • 342
  • 4
  • 11
  • How do i calculate idf based on background documents? Eg: query="british chunnel impact", then tf would be 1/3 but will idf remain 1+log(3/2)~=1.405507153 or will it change? – Salik Malik Nov 25 '20 at 15:59
3

Even if this question is marked as answered. I don't feel like it was fully answered. So if maybe anyone will need this in the future:

But I have no idea where the idf comes from.

In this example: Project 3, part 2: Searching using TF-IDF It is presented how to compute the cosine similarity between a query and a set of documents.

As @hypnoticpoisons stated the IDF is a a global component, so the IDF of a word will be the same for each document:

Note: technically, we are treating the query as if it were a new document. However, you should not recompute the IDF values: just use the ones you computed earlier.

Bakmy
  • 81
  • 6