How can I classify big text data with scikit-learn?

Question

I have large database of 50GB in size, which consists of excerpts of 486,000 dissertations in 780 specialties. For scientific purposes, it is necessary to conduct training on the basis of this data. But alas, resources are limited to a mobile processor, 16 GB of memory (+ 16Gb SWAP)

The analysis was carried out using a set of 40,000 items (10% of the base) (4.5 GB) and the SGDClassifier classifier, and the memory consumption was around 16-17 GB.

Therefore, I ask the community for help on this.

currently my code is similar

text_clf = Pipeline([
     ('count', CountVectorizer()),
     ('tfidf', TfidfTransformer()),
     ('clf', SGDClassifier(n_jobs=8),)
 ],
 )
texts_train, texts_test, cat_train, cat_test = train_test_split(texts, categories_ids, test_size=0.2)
text_clf.fit(texts_train, cat_train)

Therefore, I ask for advice on how to optimize this process so that I can process the entire database.

Chris Farr · Accepted Answer · 2019-05-23T13:57:23.870

You can utilize warm_start=True and call .partial_fit() (instead of .fit()).

See the documentation here for the model you are using where it describes that argument and function respectively.

Basically, you would load only a portion of the data at a time, run it through your pipeline and call partial_fit in a loop. This would keep the memory requirements down while also allowing you to train on all the data, regardless of the amount.

EDIT

As noted in the comments, the above mentioned loop will only work for the predictive model, so the data pre-processing will need to occur separately.

Here is a solution for training the CountVectorizer iteratively.

This question contains a TFIDF implementation that doesn't require all of the data to be loaded into memory.

So the final solution would be to preprocess the data in two stages. The first for the CountVectorizer and the second for the TFIDF weighting.

Then to train the model you follow the same process as originally proposed, except without a Pipeline because that is no longer needed.

In that case, the TfidfTransformer and CountVectorizer will not be handled properly. So OP needs to have the complete data in memory for that. — Vivek Kumar, May 23 '19 at 10:59

How can I classify big text data with scikit-learn?

1 Answers1