I solve that problem using sklearn and pandas.
Iterate in your dataset once using pandas iterator and create a set of all words, after that use it in CountVectorizer vocabulary. With that the Count Vectorizer will generate a list of sparse matrix all of them with the same shape. Now is just use vstack to group them. The sparse matrix resulted have the same information (but the words in another order) as CountVectorizer object and fitted with all your data.
That solution is not the best if you consider the time complexity but is good for memory complexity. I use that in a dataset with 20GB +,
I wrote a python code (NOT THE COMPLETE SOLUTION) that show the properties, write a generator or use pandas chunks for iterate in your dataset.
from sklearn.feature_extraction.text import CountVectorizer
from scipy.sparse import vstack
# each string is a sample
text_test = [
'good people beauty wrong',
'wrong smile people wrong',
'idea beauty good good',
]
# scikit-learn basic usage
vectorizer = CountVectorizer()
result1 = vectorizer.fit_transform(text_test)
print(vectorizer.inverse_transform(result1))
print(f"First approach:\n {result1}")
# Another solution is
vocabulary = set()
for text in text_test:
for word in text.split():
vocabulary.add(word)
vectorizer = CountVectorizer(vocabulary=vocabulary)
outputs = []
for text in text_test: # use a generator
outputs.append(vectorizer.fit_transform([text]))
result2 = vstack(outputs)
print(vectorizer.inverse_transform(result2))
print(f"Second approach:\n {result2}")
Finally, use TfidfTransformer.