4

I tried to manually calculate tfidf values using the formula but the result I got is different from the result I got when using scikit-learn implementation.

from sklearn.feature_extraction.text import TfidfVectorizer

tv = TfidfVectorizer()

a = "cat hat bat splat cat bat hat mat cat"
b = "cat mat cat sat"

tv.fit_transform([a, b]).toarray()

# array([[0.53333448, 0.56920781, 0.53333448, 0.18973594, 0.        ,
#             0.26666724],
#            [0.        , 0.75726441, 0.        , 0.37863221, 0.53215436,
#             0.        ]])

tv.get_feature_names()
# ['bat', 'cat', 'hat', 'mat', 'sat', 'splat']

I tried to manually calculate tfidf for document but result is different from TfidfVectorizer.fit_transform.

(np.log(2+1/1+1) + 1) * (2/9) = 0.5302876358044202
(np.log(2+1/2+1) + 1) * (3/9) = 0.750920989498456
(np.log(2+1/1+1) + 1) * (2/9) = 0.5302876358044202
(np.log(2+1/2+1) + 1) * (1/9) = 0.25030699649948535
(np.log(2+1/1+1) + 1) * (0/9) = 0.0
(np.log(2+1/1+1) + 1) * (1/9) = 0.2651438179022101

What I should have got is

[0.53333448, 0.56920781, 0.53333448, 0.18973594, 0, 0.26666724]
Jeeth
  • 2,226
  • 5
  • 24
  • 60

1 Answers1

2

There are many variations of TFIDF. The formula used by sklearn is:

(count_of_term_t_in_d) * ((log ((NUMBER_OF_DOCUMENTS + 1) / (Number_of_documents_where_t_appears +1 )) + 1)




2 * (np.log((1 + 2)/(1+1)) + 1) = 2.8109302162163288
3 * (np.log((1 + 2)/(2+1)) + 1) = 3.0
2 * (np.log((1 + 2)/(1+1)) + 1) = 2.8109302162163288
1 * (np.log((1 + 2)/(2+1)) + 1) = 1.0
0 * (np.log((1 + 2)/(2+1)) + 1) = 0.0
1 * (np.log((1 + 2)/(1+1)) + 1) = 1.4054651081081644

After the calculation, the final TFIDF vector is normalized by the Euclidean norm:

tfidf_vector = [2.8109302162163288, 3.0, 2.8109302162163288, 1.0, 0.0, 1.4054651081081644]

tfidf_vector = tfidf_vector / np.linalg.norm(tfidf_vector)

print(tfidf_vector)

[0.53333448, 0.56920781, 0.53333448, 0.18973594, 0, 0.26666724]
Eduardo Soares
  • 992
  • 4
  • 14
  • 1
    np.linalg.norm calculates the Euclidean norm of a vector. The Euclidean norm is defined as the square root of the sum of the square of the componentes. `np.sqrt(np.sum(tfidf_vector ** 2))` – Eduardo Soares Feb 18 '19 at 22:46
  • Got it. Thanks .. Do you know anything regarding `TFIDFTransformer` and how it's different from `TFIDFVectorizer`? It's an another [question](https://stackoverflow.com/questions/54745482/what-is-the-difference-between-tfidf-vectorizer-and-tfidf-transformer/54748136#54748136) I've asked regarding this. – Jeeth Feb 18 '19 at 23:00