Calculating IDF using TfidfVectorizer from sklearn.feature_extraction.text.TfidfVectorizer

Question

I think the function TfidfVectorizer is not calculating correctly the IDF factor. For example, copying the code from tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(
                        use_idf=True, # utiliza o idf como peso, fazendo tf*idf
                        norm=None, # normaliza os vetores
                        smooth_idf=False, #soma 1 ao N e ao ni => idf = ln(N+1 / ni+1)
                        sublinear_tf=False, #tf = 1+ln(tf)
                        binary=False,
                        min_df=1, max_df=1.0, max_features=None,
                        strip_accents='unicode', # retira os acentos
                        ngram_range=(1,1), preprocessor=None,              stop_words=None, tokenizer=None, vocabulary=None
             )
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

The Output is:

{u'is': 1.0,
 u'nice': 1.6931471805599454,
 u'strange': 1.6931471805599454,
 u'this': 1.0,
 u'very': 1.0}`

But should be:

{u'is': 0.0,
 u'nice': 0.6931471805599454,
 u'strange': 0.6931471805599454,
 u'this': 0.0,
 u'very': 0.0}

Isn't it? What am I doing wrong?

Whereas the calculation of IDF, according to http://www.tfidf.com/, is:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Thus, as the terms 'this', 'is' and 'very' appear in two sentences, the IDF = log_e (2/2) = 0.

The terms 'strange' and 'nice' appear in only one of the two documents, so log_e(2/1) = 0,69314.

Hey Priscilla. I"m not a Python user, but can you clarify what you're trying to do and the problem you've encountered? You're more likely to get responses from expert users if they understand your exact goal, why you're trying to reach it, and why the output is wrong. Best of luck getting an answer, and welcome to Stack Overflow! — Dylan Knowles, Apr 20 '16 at 22:39
I really need to understand what I can do to get the correct tf-idf values using this sklearn function because they are returning wrong. — Priscilla Lusie, Apr 28 '16 at 17:48

zemekeneng · Answer 1 · 2016-04-28T18:22:00.827

Two things are happening that you might not expect in the sklearn implimentation:

The TfidfTransformer has smooth_idf=True as a default param
It always adds 1 to the weight

So it is using:

idf = log( 1 + samples/documents) + 1

Here it is in the source:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992

EDIT: You could subclass the standard TfidfVectorizer class like this:

import scipy.sparse as sp
import numpy as np
from sklearn.feature_extraction.text import (TfidfVectorizer,
                                             _document_frequency)
class PriscillasTfidfVectorizer(TfidfVectorizer):

    def fit(self, X, y=None):
        """Learn the idf vector (global term weights)
        Parameters
        ----------
        X : sparse matrix, [n_samples, n_features]
            a matrix of term/token counts
        """
        if not sp.issparse(X):
            X = sp.csc_matrix(X)
        if self.use_idf:
            n_samples, n_features = X.shape
            df = _document_frequency(X)

            # perform idf smoothing if required
            df += int(self.smooth_idf)
            n_samples += int(self.smooth_idf)

            # log+1 instead of log makes sure terms with zero idf don't get
            # suppressed entirely.
            ####### + 1 is commented out ##########################
            idf = np.log(float(n_samples) / df) #+ 1.0  
            #######################################################
            self._idf_diag = sp.spdiags(idf,
                                        diags=0, m=n_features, n=n_features)

        return self

I have already set all parameters as false but the answer still remains incorrect. — Priscilla Lusie, Apr 28 '16 at 17:44
The ` + 1` at the end of the formula cannot be changed in the params. If you want to get your expected answer, use `smooth_idf=False` and subtract 1. — zemekeneng, Apr 28 '16 at 18:07

score 1 · Answer 2 · answered May 19 '17 at 17:08

The actual formula they use in computing idf (when smooth_idf is True) is

idf = log( (1 + samples)/(documents + 1)) + 1

It's in the source but the web documentation is a little bit ambiguous about it I think.

https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/feature_extraction/text.py#L966-L969

Calculating IDF using TfidfVectorizer from sklearn.feature_extraction.text.TfidfVectorizer

2 Answers2