Add document to scikit-learn's CountVectorizer?

Question

I want to add a document to a pre-generated matrix using CountVectorizer.

word_counter = CountVectorizer()
words_matrix = word_counter.fit_transform(['first string','second string'])

Now I want to add another string 'third string' to words_matrix. Extending the matrix - something like this:

words_matrix += word_counter.fit_transform(['third string'])

But I can't get it to work without fit_transforming it all together.

What do you mean by extending? Do you want to update the vocabulary learned by the TfidfVectorizer? — Vivek Kumar, Apr 21 '17 at 14:56
I want to: fit_transform(['first string', 'second string']) & fit_transform(['third string']), so that the result is equivalent to: fit_transform(['first string', 'second string', 'third string']) — Alexander Hades, Apr 21 '17 at 15:03
Yeah, thats called incremental learning and not possible to TfidfVectorizer. TfidfVectorizer is meant for smaller data, which could fit in memory at once. You should look into [`HashingVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) — Vivek Kumar, Apr 21 '17 at 16:38
[Here is a very cool example of Out of Core Classification](http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html) — MaxU - stand with Ukraine, Apr 22 '17 at 09:50
For those who HashingVectorizer doesn't meet their needs, see a possible alternative in my answer to this related question [here](https://stackoverflow.com/questions/25154231/updating-the-feature-names-into-scikit-tfidfvectorizer/47639930#47639930). It's basically a custom implementation of partial fitting (or incremental fitting) for TfidfVectorizer and CountVectorizer. — Ido S, Dec 04 '17 at 18:50

0 Answers0