error in computing text similarity using scikit learn

Question

I'm a beginner in vector space model (VSM). And i tried the code from this site. It's a very good intoduction to VSM but i somehow managed to get different results from the author. It might be because of some compatibility problem as scikit learn seems to have changed a lot since the introduction was written. It might be that i misunderstood the explanation as well.
I used the code below to get the wrong answer. Can someone figure out what is wrong with it? I post the result of the code below and the right answer below

I have done the computation by hand so i know that the results of website are good. There is another Stackoverflow question that use the same code but it doesn't get the same results as the website either.

import numpy, scipy, sklearn

train_set = ("The sky is blue.","The sun is bright.")
test_set = ("The sun is the sky is bright.", "We can see the shining sun, the bright sun.")

from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words= 'english')

vectorizer.fit_transform(train_set)


smatrix = vectorizer.transform(test_set)


from sklearn.feature_extraction.text import TfidfTransformer


tfidf = TfidfTransformer(norm='l2', sublinear_tf=True)


tfidf.fit(smatrix)
#print smatrix.todense()
print tfidf.idf_

tf_idf_matrix = tfidf.transform(smatrix)
print tf_idf_matrix.todense()

results vector of tf-idf
#[ 2.09861229 1. 1.40546511 1. ]

right vector of tf-idf
#[0.69314718, -0.40546511, -0.40546511, 0]

results tf_idf_matrix
#[[ 0. 0.50154891 0.70490949 0.50154891]
#[ 0. 0.50854232 0. 0.861037 ]]

right answer
# [[ 0. -0.70710678 -0.70710678 0. ]
# [ 0. -0.89442719 -0.4472136 0. ]]

What is the output of the print statements? – gypaetus Sep 08 '13 at 19:44 — gypaetus, Sep 08 '13 at 19:44

justhalf · Accepted Answer · 2013-09-10T02:56:08.420

4

It's not your fault, it's because of different formula used in current sklearn and the one used in the tutorial.

The current version of sklearn uses this formula (source):

idf = log ( n_samples / df ) + 1

where n_samples refers to the total number of documents (|D| in the tutorial) and df refers to the number of documents in which the term appears ({d:t_1 \in D} in the tutorial).

To deal with zero division, they by default use smoothing (option smooth_idf=True in TfidfVectorizer, see documentation) that changes the df and n_samples values like this, so those values would be at least 1:

df += 1
n_samples += 1

While the one in the tutorial uses this formula:

idf = log ( n_samples / (1+df) )

So, you can't get the exact same result as the one in the tutorial, unless you change the formula in the source code.

Edit:

Strictly speaking, the right formula is log(n_samples/df), but since it causes the zero-division problem in practice, people try to modify the formula to allow it to be used in all cases. The most common one is like you said: log(n_samples/(1+df)), but it's not wrong also to use the formula log(n_samples/df)+1 given that you've already smoothed it beforehand. But reading the code history, it seems that they did that so that they won't have negative IDF value (as discussed in this pull request and later updated in this fix). Another way to remove negative IDF value is simply by converting negative values to 0. I have yet to find which one is the more commonly used method.

They did agree that the way they do it is not the standard way. So you can safely say that log(n_samples/(1+df)) is the correct way.

To edit the formula, first I must warn you that this will affect every user that uses the code, make sure you know what you're doing.

You can just go to the source code (in Unix: it's at /usr/local/lib/python2.7/dist-packages/sklearn/feature_extraction/text.py, in Windows: I'm not using Windows now, but you can search for the file "text.py") and edit the formula directly. You might need administrator/root access, depending on the platform you use.

Additional note:

As an additional note, the order of terms in the vocabulary is also different (at least in my machine), so to get the exact same result (if the formula is the same), you also need to pass in the exact same vocabulary as shown in the tutorial. So using your code:

vocabulary = {'blue':0, 'sun':1, 'bright':2, 'sky':3}
vectorizer = CountVectorizer(vocabulary=vocabulary) # You don't need stop_words if you use vocabulary
vectorizer.fit_transform(train_set)
print 'Vocabulary:', vectorizer.vocabulary_
# Vocabulary: {'blue': 0, 'sun': 1, 'bright': 2, 'sky': 3}

edited Sep 10 '13 at 02:56

answered Sep 09 '13 at 06:25

justhalf

8,960
3
47
74

Thanks.It seems that the formula is the code is not the right one. Isn't it? I'm a beginner in this field. How can i edit the function if it is not right? – DJJ Sep 09 '13 at 08:33
but I think the right formula would be idf = log ( n_samples / (1+df) ). Can you confirm please. – DJJ Sep 09 '13 at 19:32
1

The formula in the code might not be "the right one" but in practice it works well to extract good features for machine learning and cosine similarity. @DJJ Are you sure you want to change the code in scikit-learn? Doesn't it yield the same similarity rankings as in the blog post? – ogrisel Sep 10 '13 at 10:06
We don't post a comment to say thanks on Stackoverflow. But you really made my day. Many thanks – DJJ Sep 10 '13 at 12:56
Thanks @ogrisel for your concern. As you see this method are not very familiar to me. What i can say is that I understand well the code of the previous version. Do you have any reference about the new one. – DJJ Sep 12 '13 at 17:21
References are almost always [included in the doc](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.TfidfTransformer.html#sklearn.feature_extraction.text.TfidfTransformer). I am not 100% sure we follow exactly those references. It's just that the last time I tried to tweak the smoothing parameters to make it more standard it hurts the performance of subsequent clustering or classification algorithms, so I decided to left as it is now. – ogrisel Sep 13 '13 at 08:46
I agree it might be a considered a "bug" but to fix it someone needs to take the time to reread the literature, check exactly how we are different from it and then write a bunch of evaluation scripts / tests on various datasets and tasks to evaluate the impact of a change in the smoothing params and so far nobody did. – ogrisel Sep 13 '13 at 08:49
You might also be interested in this answer to a related question: http://stackoverflow.com/a/12128777/163740 – ogrisel Sep 13 '13 at 08:52
4

My main point is that TF-IDF is fundamentally just a hack to weight text features to make machine learning and cosine sim queries work well on those features. There is no "universally best" way to do TF-IDF. The "right" variant of TF-IDF is the one that works best as measured by some specific metric (e.g. F1-score) for a specific machine learning task (e.g. multi-class classification). – ogrisel Sep 13 '13 at 08:55
These are definitely useful information. And you are right that I may have rushed to conclusion too quickly. I will surely increase my knowledge on these methods soon. Many thanks to you – DJJ Sep 13 '13 at 14:36

error in computing text similarity using scikit learn

1 Answers1