I have 9GB of segmented documents on my disk and my vps only has 4GB memory.
How can I vectorize all the data set without loading all the corpus at initialization? Is there any sample code?
my code is as follows:
contents = [open('./seg_corpus/' + filename).read()
for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words)
vectorizer.fit(contents)