I have large database of 50GB in size, which consists of excerpts of 486,000 dissertations in 780 specialties. For scientific purposes, it is necessary to conduct training on the basis of this data. But alas, resources are limited to a mobile processor, 16 GB of memory (+ 16Gb SWAP)
The analysis was carried out using a set of 40,000 items (10% of the base) (4.5 GB) and the SGDClassifier classifier, and the memory consumption was around 16-17 GB.
Therefore, I ask the community for help on this.
currently my code is similar
text_clf = Pipeline([
('count', CountVectorizer()),
('tfidf', TfidfTransformer()),
('clf', SGDClassifier(n_jobs=8),)
],
)
texts_train, texts_test, cat_train, cat_test = train_test_split(texts, categories_ids, test_size=0.2)
text_clf.fit(texts_train, cat_train)
Therefore, I ask for advice on how to optimize this process so that I can process the entire database.