Scikit-learn tfidf vectorizer in minibatches?

Question

I've been trying to perform tf-idf heuristic on a large corpus.

Can I iteratively read the documents, and call the

vectorizer.fit()

In each iteration? Does this take into account only the current iteration, or does it remember the previous ones?

Thanks!

Every time you call fit, the vocabulary will be initialized from scratch so that is not an option. — benbo, Jan 15 '19 at 12:56

benbo · Accepted Answer · 2019-01-15T14:52:06.000

1

The solution to your problem will depend on your particular application. You could consider gensim's tfidf implementation which is more efficient and does not need to keep the entire corpus in memory as this post explains.

edited Jan 15 '19 at 14:52

answered Jan 15 '19 at 12:57

benbo

1,471
1
16
29

Thanks! This is exactly what I was looking for. – sdgaw erzswer Jan 16 '19 at 08:29

Scikit-learn tfidf vectorizer in minibatches?

1 Answers1