13

As the title states: Is a countvectorizer the same as tfidfvectorizer with use_idf=false ? If not why not ?

So does this also mean that adding the tfidftransformer here is redundant ?

vect = CountVectorizer(min_df=1)
tweets_vector = vect.fit_transform(corpus)
tf_transformer = TfidfTransformer(use_idf=False).fit(tweets_vector)
tweets_vector_tf = tf_transformer.transform(tweets_vector)
Olivier_s_j
  • 5,490
  • 24
  • 80
  • 126
  • check out in detail explanation here https://manjunathhiremathm.wixsite.com/portfolio/blog-1/countvectorizer-v-s-tfidfvector – manju h Oct 24 '17 at 10:18

2 Answers2

33

No, they're not the same. TfidfVectorizer normalizes its results, i.e. each vector in its output has norm 1:

>>> CountVectorizer().fit_transform(["foo bar baz", "foo bar quux"]).A
array([[1, 1, 1, 0],
       [1, 0, 1, 1]])
>>> TfidfVectorizer(use_idf=False).fit_transform(["foo bar baz", "foo bar quux"]).A
array([[ 0.57735027,  0.57735027,  0.57735027,  0.        ],
       [ 0.57735027,  0.        ,  0.57735027,  0.57735027]])

This is done so that dot-products on the rows are cosine similarities. Also TfidfVectorizer can use logarithmically discounted frequencies when given the option sublinear_tf=True.

To make TfidfVectorizer behave as CountVectorizer, give it the constructor options use_idf=False, normalize=None.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • 3
    I am not sure if there was a change in the API after since the last post, but it's the `norm` parameter instead of `normalize` –  May 18 '15 at 16:43
  • @Fred Foo, really good explanation. I have just one question, the TfidfVectorizer does not normalize the term frequency vector when use_idf=True right? in other words, it does not carry out two normalization process: one for the tf and one for the tfidf but just one for the tfidf. – Economist_Ayahuasca May 02 '16 at 16:47
1

As larsmans said, TfidfVectorizer(use_idf=False, normalize=None, ...) is supposed to behave the same as CountVectorizer.

In the current version (0.14.1), there's a bug where TfidfVectorizer(binary=True, ...) silently leaves binary=False, which can throw you off during a grid search for the best parameters. (CountVectorizer, in contrast, sets the binary flag correctly.) This appears to be fixed in future (post-0.14.1) versions.