8

I think the function TfidfVectorizer is not calculating correctly the IDF factor. For example, copying the code from tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer:

from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
          "This is very nice"]
vectorizer = TfidfVectorizer(
                        use_idf=True, # utiliza o idf como peso, fazendo tf*idf
                        norm=None, # normaliza os vetores
                        smooth_idf=False, #soma 1 ao N e ao ni => idf = ln(N+1 / ni+1)
                        sublinear_tf=False, #tf = 1+ln(tf)
                        binary=False,
                        min_df=1, max_df=1.0, max_features=None,
                        strip_accents='unicode', # retira os acentos
                        ngram_range=(1,1), preprocessor=None,              stop_words=None, tokenizer=None, vocabulary=None
             )
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))

The Output is:

{u'is': 1.0,
 u'nice': 1.6931471805599454,
 u'strange': 1.6931471805599454,
 u'this': 1.0,
 u'very': 1.0}`

But should be:

{u'is': 0.0,
 u'nice': 0.6931471805599454,
 u'strange': 0.6931471805599454,
 u'this': 0.0,
 u'very': 0.0}

Isn't it? What am I doing wrong?

Whereas the calculation of IDF, according to http://www.tfidf.com/, is:

IDF(t) = log_e(Total number of documents / Number of documents with term t in it)

Thus, as the terms 'this', 'is' and 'very' appear in two sentences, the IDF = log_e (2/2) = 0.

The terms 'strange' and 'nice' appear in only one of the two documents, so log_e(2/1) = 0,69314.

Community
  • 1
  • 1
Priscilla Lusie
  • 81
  • 1
  • 1
  • 4
  • Hey Priscilla. I"m not a Python user, but can you clarify what you're trying to do and the problem you've encountered? You're more likely to get responses from expert users if they understand your exact goal, why you're trying to reach it, and why the output is wrong. Best of luck getting an answer, and welcome to Stack Overflow! – Dylan Knowles Apr 20 '16 at 22:39
  • I really need to understand what I can do to get the correct tf-idf values using this sklearn function because they are returning wrong. – Priscilla Lusie Apr 28 '16 at 17:48

2 Answers2

6

Two things are happening that you might not expect in the sklearn implimentation:

  1. The TfidfTransformer has smooth_idf=True as a default param
  2. It always adds 1 to the weight

So it is using:

idf = log( 1 + samples/documents) + 1

Here it is in the source:

https://github.com/scikit-learn/scikit-learn/blob/master/sklearn/feature_extraction/text.py#L987-L992

EDIT: You could subclass the standard TfidfVectorizer class like this:

import scipy.sparse as sp
import numpy as np
from sklearn.feature_extraction.text import (TfidfVectorizer,
                                             _document_frequency)
class PriscillasTfidfVectorizer(TfidfVectorizer):

    def fit(self, X, y=None):
        """Learn the idf vector (global term weights)
        Parameters
        ----------
        X : sparse matrix, [n_samples, n_features]
            a matrix of term/token counts
        """
        if not sp.issparse(X):
            X = sp.csc_matrix(X)
        if self.use_idf:
            n_samples, n_features = X.shape
            df = _document_frequency(X)

            # perform idf smoothing if required
            df += int(self.smooth_idf)
            n_samples += int(self.smooth_idf)

            # log+1 instead of log makes sure terms with zero idf don't get
            # suppressed entirely.
            ####### + 1 is commented out ##########################
            idf = np.log(float(n_samples) / df) #+ 1.0  
            #######################################################
            self._idf_diag = sp.spdiags(idf,
                                        diags=0, m=n_features, n=n_features)

        return self
zemekeneng
  • 1,660
  • 2
  • 15
  • 26
1

The actual formula they use in computing idf (when smooth_idf is True) is

idf = log( (1 + samples)/(documents + 1)) + 1

It's in the source but the web documentation is a little bit ambiguous about it I think.

https://github.com/scikit-learn/scikit-learn/blob/14031f6/sklearn/feature_extraction/text.py#L966-L969