3

Given a corpus of 3 documents, for example:

   sentences = ["This car is fast",
                "This car is pretty",
                "Very fast truck"]

I am executing by hand the calculation of tf-idf.

For document 1, and the word "car", I can find that:

TF = 1/4
IDF = log(3/2)
TF-IDF = 1/4 * log(3/2)

Same result should apply to document 2, since it has 4 words, and one of them is "car".

I have tried to apply this in sklearn, with the code below:

from sklearn.feature_extraction.text import TfidfVectorizer
import pandas as pd

data = {'text': sentences}
df = pd.DataFrame(data)
tv = TfidfVectorizer()
tfvector = tv.fit_transform(df.text)
print(pd.DataFrame(tfvector.toarray(), columns=tv.get_feature_names()))

And the result I get is:

        car     fast        is    pretty      this     truck      very
0  0.500000  0.50000  0.500000  0.000000  0.500000  0.000000  0.000000
1  0.459854  0.00000  0.459854  0.604652  0.459854  0.000000  0.000000
2  0.000000  0.47363  0.000000  0.000000  0.000000  0.622766  0.622766

I can understand that sklearn uses L2 normalization, but still, shouldn't the tf-idf score of "car" in the first two documents be the same? Can anyone help me understanding the results?

XuUserAC
  • 141
  • 6
  • I think, before the L2 normalization, the score for "car" in both documents is the same. But the total score in the first row is lower than the total score in the second row (because we have the high IDF word 'pretty', causing different impact in the normalization, hence different values. – Kevin Winata Jul 06 '19 at 09:28
  • 1
    Oh I think I get it.. I misinterpreted what I read for the L2 normalization i guess. Thanks a lot for your answer! – XuUserAC Jul 06 '19 at 09:31
  • Possible duplicate of [Why is the value of TF-IDF different from IDF\_?](https://stackoverflow.com/questions/56653159/why-is-the-value-of-tf-idf-different-from-idf) – Venkatachalam Jul 08 '19 at 05:39
  • [this](https://stackoverflow.com/questions/53920770/how-is-the-tf-idf-value-calculated-with-analyzer-char/53923657#53923657) could also help – Venkatachalam Jul 08 '19 at 05:43

1 Answers1

0

It is because of the normalization. If you add the parameter norm=None to the TfIdfVectorizer(norm=None), you will get the following result, which has the same value for car

        car      fast        is    pretty      this     truck      very
0  1.287682  1.287682  1.287682  0.000000  1.287682  0.000000  0.000000
1  1.287682  0.000000  1.287682  1.693147  1.287682  0.000000  0.000000
2  0.000000  1.287682  0.000000  0.000000  0.000000  1.693147  1.693147
Kevin Winata
  • 443
  • 2
  • 10