tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

Question

this page: http://scikit-learn.org/stable/modules/feature_extraction.html mentions:

As tf–idf is a very often used for text features, there is also another class called TfidfVectorizer that combines all the option of CountVectorizer and TfidfTransformer in a single model.

then I followed the code and use fit_transform() on my corpus. How to get the weight of each feature computed by fit_transform()?

I tried:

In [39]: vectorizer.idf_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-5475eefe04c0> in <module>()
----> 1 vectorizer.idf_

AttributeError: 'TfidfVectorizer' object has no attribute 'idf_'

but this attribute is missing.

Thanks

Judging from the examles in the documentation, I'd say you're supposed to use the return value of `vectorizer.fit_transform(corpus)`. — Lukas Graf, May 21 '14 at 20:16
the return value is a scipy sparse_matrix that store the normalized feature. — fast tooth, May 21 '14 at 20:18

YS-L · Accepted Answer · 2015-02-17T08:34:35.450

84

Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_ of the TfidfVectorizer object:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

Output:

{u'is': 1.0,
 u'nice': 1.4054651081081644,
 u'strange': 1.4054651081081644,
 u'this': 1.0,
 u'very': 1.0}

As discussed in the comments, prior to version 0.15, a workaround is to access the attribute idf_ via the supposedly hidden _tfidf (an instance of TfidfTransformer) of the vectorizer:

idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

which should give the same output as above.

edited Feb 17 '15 at 08:34

answered May 22 '14 at 01:47

YS-L

14,358
3
47
58

This is a bug. Users shouldn't need to access leading `_` members. – Fred Foo May 22 '14 at 07:30
I see. Should ``TfidfVectorizer`` expose a ``idf`` attribute directly? Seems reasonable to have that. – YS-L May 22 '14 at 08:08
how to add stopwords in it? – Nurdin Apr 02 '16 at 05:39
5

@YS-L this is just the IDF score, correct, not the full TF-IDF ? – Felipe May 24 '17 at 05:52
I have a doubt here regarding calculation since 'nice' or 'strange' have appeared once out of two document shouldn't it's idf equal 1 + log(2)e => 1.69 rather than 1.40 stated above? – Inherited Geek Jul 02 '17 at 12:26
Or if it's base 10 then 1 + log(2)10 gives 1.30 not 1.40 – Inherited Geek Jul 02 '17 at 12:38
@InheritedGeek First, you have to multiply the two factors and not sum them. Second, in this case Log(10) Apply. Finally, the Euclidean (L2) norm is applied when you use TfidVectorizer. For more information read: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction – Laura Dec 11 '19 at 14:13

score 1 · Answer 2 · answered Nov 29 '17 at 05:38

See also this on how to get the TF-IDF values of all the documents:

feature_names = tf.get_feature_names()
doc = 0
feature_index = X[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [X[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print w, s

this 0.448320873199
is 0.448320873199
very 0.448320873199
strange 0.630099344518

#and for doc=1
this 0.448320873199
is 0.448320873199
very 0.448320873199
nice 0.630099344518

I think the results are normalized by document:

>>>0.4483208731992+0.4483208731992+0.4483208731992+0.6300993445182 0.9999999999997548

tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer

2 Answers2

Linked