1

I want to add a document to a pre-generated matrix using CountVectorizer.

word_counter = CountVectorizer()
words_matrix = word_counter.fit_transform(['first string','second string'])

Now I want to add another string 'third string' to words_matrix. Extending the matrix - something like this:

words_matrix += word_counter.fit_transform(['third string'])

But I can't get it to work without fit_transforming it all together.

  • What do you mean by extending? Do you want to update the vocabulary learned by the TfidfVectorizer? – Vivek Kumar Apr 21 '17 at 14:56
  • I want to: fit_transform(['first string', 'second string']) & fit_transform(['third string']), so that the result is equivalent to: fit_transform(['first string', 'second string', 'third string']) – Alexander Hades Apr 21 '17 at 15:03
  • 1
    Yeah, thats called incremental learning and not possible to TfidfVectorizer. TfidfVectorizer is meant for smaller data, which could fit in memory at once. You should look into [`HashingVectorizer`](http://scikit-learn.org/stable/modules/generated/sklearn.feature_extraction.text.HashingVectorizer.html#sklearn.feature_extraction.text.HashingVectorizer) – Vivek Kumar Apr 21 '17 at 16:38
  • [Here is a very cool example of Out of Core Classification](http://scikit-learn.org/dev/auto_examples/applications/plot_out_of_core_classification.html) – MaxU - stand with Ukraine Apr 22 '17 at 09:50
  • For those who HashingVectorizer doesn't meet their needs, see a possible alternative in my answer to this related question [here](https://stackoverflow.com/questions/25154231/updating-the-feature-names-into-scikit-tfidfvectorizer/47639930#47639930). It's basically a custom implementation of partial fitting (or incremental fitting) for TfidfVectorizer and CountVectorizer. – Ido S Dec 04 '17 at 18:50
  • See https://stackoverflow.com/q/69156995/10495893 – Ben Reiniger Sep 13 '21 at 16:06

0 Answers0