I think the function TfidfVectorizer is not calculating correctly the IDF factor. For example, copying the code from tf-idf feature weights using sklearn.feature_extraction.text.TfidfVectorizer:
from sklearn.feature_extraction.text import TfidfVectorizer
corpus = ["This is very strange",
"This is very nice"]
vectorizer = TfidfVectorizer(
use_idf=True, # utiliza o idf como peso, fazendo tf*idf
norm=None, # normaliza os vetores
smooth_idf=False, #soma 1 ao N e ao ni => idf = ln(N+1 / ni+1)
sublinear_tf=False, #tf = 1+ln(tf)
binary=False,
min_df=1, max_df=1.0, max_features=None,
strip_accents='unicode', # retira os acentos
ngram_range=(1,1), preprocessor=None, stop_words=None, tokenizer=None, vocabulary=None
)
X = vectorizer.fit_transform(corpus)
idf = vectorizer.idf_
print dict(zip(vectorizer.get_feature_names(), idf))
The Output is:
{u'is': 1.0,
u'nice': 1.6931471805599454,
u'strange': 1.6931471805599454,
u'this': 1.0,
u'very': 1.0}`
But should be:
{u'is': 0.0,
u'nice': 0.6931471805599454,
u'strange': 0.6931471805599454,
u'this': 0.0,
u'very': 0.0}
Isn't it? What am I doing wrong?
Whereas the calculation of IDF, according to http://www.tfidf.com/, is:
IDF(t) = log_e(Total number of documents / Number of documents with term t in it)
Thus, as the terms 'this', 'is' and 'very' appear in two sentences, the IDF = log_e (2/2) = 0.
The terms 'strange' and 'nice' appear in only one of the two documents, so log_e(2/1) = 0,69314.