I have a large corpus (~100 million documents, 59GB) in a CSV. I want to create a TF-IDF vector and do some feature engineering on the data, but it's too large to load into memory all at once (I'm working on Google Colab, GPU with 12GB RAM). I imagine there is a way to process the data in chunks and then combine the TF-IDFs at the end but I'm not sure how to proceed. Here's my code so far:
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer, TfidfTransformer
chunks = pd.read_csv("data.csv.bz2",
chunksize=1000000,
nrows=120000000,
)
print(type(chunks)) # <class 'pandas.io.parsers.TextFileReader'>
Then, after removing stopwords and punctuation, lemmatizing (WordNetLemmatizer()), and stemming (SnowballStemmer('english')):
count_vectorizer = CountVectorizer()
chunk1_counts = count_vectorizer.fit_transform(chunk1.comment)
tfidf_transformer = TfidfTransformer()
chunk1_tfidf = tfidf_transformer.fit_transform(chunk1_counts)
I can read in a few chunks at a time, but to avoid memory errors I'll probably have to save the results to disk and delete the objects from memory before processing the next set of chunks.
At that point, what's the process to combine the multiple TF-IDFs?