I'm a beginner in vector space model (VSM). And i tried the code from
this site. It's a very good intoduction to VSM but i somehow managed to get different results from the author. It might be because of some compatibility problem as scikit learn seems to have changed a lot since the introduction was written. It might be that i misunderstood the explanation as well.
I used the code below to get the wrong answer. Can someone figure out what is wrong with it? I post the result of the code below and the right answer below
I have done the computation by hand so i know that the results of website are good.
There is another Stackoverflow question that use the same code but it doesn't get the same results as the website either.
import numpy, scipy, sklearn
train_set = ("The sky is blue.","The sun is bright.")
test_set = ("The sun is the sky is bright.", "We can see the shining sun, the bright sun.")
from sklearn.feature_extraction.text import CountVectorizer
vectorizer = CountVectorizer(stop_words= 'english')
vectorizer.fit_transform(train_set)
smatrix = vectorizer.transform(test_set)
from sklearn.feature_extraction.text import TfidfTransformer
tfidf = TfidfTransformer(norm='l2', sublinear_tf=True)
tfidf.fit(smatrix)
#print smatrix.todense()
print tfidf.idf_
tf_idf_matrix = tfidf.transform(smatrix)
print tf_idf_matrix.todense()
results vector of tf-idf
#[ 2.09861229 1. 1.40546511 1. ]
right vector of tf-idf
#[0.69314718, -0.40546511, -0.40546511, 0]
results tf_idf_matrix
#[[ 0. 0.50154891 0.70490949 0.50154891]
#[ 0. 0.50854232 0. 0.861037 ]]
right answer
# [[ 0. -0.70710678 -0.70710678 0. ]
# [ 0. -0.89442719 -0.4472136 0. ]]