32

this page: http://scikit-learn.org/stable/modules/feature_extraction.html mentions:

As tf–idf is a very often used for text features, there is also another class called TfidfVectorizer that combines all the option of CountVectorizer and TfidfTransformer in a single model.

then I followed the code and use fit_transform() on my corpus. How to get the weight of each feature computed by fit_transform()?

I tried:

In [39]: vectorizer.idf_
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-39-5475eefe04c0> in <module>()
----> 1 vectorizer.idf_

AttributeError: 'TfidfVectorizer' object has no attribute 'idf_'

but this attribute is missing.

Thanks

fast tooth
  • 2,317
  • 4
  • 25
  • 34

2 Answers2

84

Since version 0.15, the tf-idf score of each feature can be retrieved via the attribute idf_ of the TfidfVectorizer object:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(min_df=1)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

Output:

{u'is': 1.0,
 u'nice': 1.4054651081081644,
 u'strange': 1.4054651081081644,
 u'this': 1.0,
 u'very': 1.0}

As discussed in the comments, prior to version 0.15, a workaround is to access the attribute idf_ via the supposedly hidden _tfidf (an instance of TfidfTransformer) of the vectorizer:

idf = vectorizer._tfidf.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

which should give the same output as above.

YS-L
  • 14,358
  • 3
  • 47
  • 58
  • This is a bug. Users shouldn't need to access leading `_` members. – Fred Foo May 22 '14 at 07:30
  • I see. Should ``TfidfVectorizer`` expose a ``idf`` attribute directly? Seems reasonable to have that. – YS-L May 22 '14 at 08:08
  • how to add stopwords in it? – Nurdin Apr 02 '16 at 05:39
  • 5
    @YS-L this is just the IDF score, correct, not the full TF-IDF ? – Felipe May 24 '17 at 05:52
  • I have a doubt here regarding calculation since 'nice' or 'strange' have appeared once out of two document shouldn't it's idf equal 1 + log(2)e => 1.69 rather than 1.40 stated above? – Inherited Geek Jul 02 '17 at 12:26
  • Or if it's base 10 then 1 + log(2)10 gives 1.30 not 1.40 – Inherited Geek Jul 02 '17 at 12:38
  • @InheritedGeek First, you have to multiply the two factors and not sum them. Second, in this case Log(10) Apply. Finally, the Euclidean (L2) norm is applied when you use TfidVectorizer. For more information read: https://scikit-learn.org/stable/modules/feature_extraction.html#text-feature-extraction – Laura Dec 11 '19 at 14:13
1

See also this on how to get the TF-IDF values of all the documents:

feature_names = tf.get_feature_names()
doc = 0
feature_index = X[doc,:].nonzero()[1]
tfidf_scores = zip(feature_index, [X[doc, x] for x in feature_index])
for w, s in [(feature_names[i], s) for (i, s) in tfidf_scores]:
    print w, s

this 0.448320873199
is 0.448320873199
very 0.448320873199
strange 0.630099344518

#and for doc=1
this 0.448320873199
is 0.448320873199
very 0.448320873199
nice 0.630099344518

I think the results are normalized by document:

>>>0.4483208731992+0.4483208731992+0.4483208731992+0.6300993445182 0.9999999999997548

aless80
  • 3,122
  • 3
  • 34
  • 53