1

I have 9GB of segmented documents on my disk and my vps only has 4GB memory.

How can I vectorize all the data set without loading all the corpus at initialization? Is there any sample code?

my code is as follows:

contents = [open('./seg_corpus/' + filename).read()
            for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words)
vectorizer.fit(contents)
glls
  • 2,325
  • 1
  • 22
  • 39
Kalen Blue
  • 11
  • 3

1 Answers1

1

Try this, instead of loading all texts into memory you can pass only handles to files into fit method, but you must specify input='file' in CountVectorizer constructor.

contents = [open('./seg_corpus/' + filename)
        for filename in filenames]
vectorizer = CountVectorizer(stop_words=stop_words, input='file')
vectorizer.fit(contents)
Ibraim Ganiev
  • 8,934
  • 3
  • 33
  • 52
  • Thank you. Anothor question, Is there some tricks on scikit-learn's KMeans like " input='file' ", I also cannot load the sparse matrix into memory. – Kalen Blue Oct 15 '16 at 16:09
  • @KalenBlue, it's quite strange you cannot load sparse matrix into memory, is it so big? Or does some error happens when you're trying to use KMeans on it? Because it looks more like programmers mistake when one cannot load sparse matrix into memory. Anyway, you may try to store matrix in different batches, load them separately, and use `MiniBatchKMeans`, with partial_fit method. Or (easier method) compress feature space, and make it possible to keep sparse matrix in memory. For example try to remove all features which are created by too frequent n-gramms, or too rare. – Ibraim Ganiev Oct 15 '16 at 16:54
  • You can play with max_features, max_df and min_df parameters to make resulting matrix of CountVectorizer smaller. – Ibraim Ganiev Oct 15 '16 at 17:06
  • Or consider using alternate vectorizers that don't hold vocabularies, like HashingVectorizer, I believe. – rabbit Oct 15 '16 at 20:57
  • @IbraimGaniev I have vectorized the corpus into filesystem. About 2MB per vector without compression. And there are 8,000 documents, so my memory is not big enough. I am going to use a iterator to – Kalen Blue Oct 16 '16 at 07:55
  • @IbraimGaniev I am going to use a iterator to save the memory, but I may implement k-means by myself. Is the KMeans of scikit-learn can satisfy my requirement? – Kalen Blue Oct 16 '16 at 08:00
  • @NBartley thx. I have successfully vectorized the corpus by initializing the CountVectorizer with vocabulary. I will have a try about HashingVectorizer. – Kalen Blue Oct 16 '16 at 08:09