How can i reduce memory usage of Scikit-Learn Vectorizers?

Question

TFIDFVectorizer takes so much memory ,vectorizing 470 MB of 100k documents takes over 6 GB , if we go 21 million documents it will not fit 60 GB of RAM we have.

So we go for HashingVectorizer but still need to know how to distribute the hashing vectorizer.Fit and partial fit does nothing so how to work with Huge Corpus?

score 10 · Accepted Answer · answered Jul 08 '13 at 21:57

10

I would strongly recommend you to use the HashingVectorizer when fitting models on large dataset.

The HashingVectorizer is data independent, only the parameters from vectorizer.get_params() are important. Hence (un)pickling `HashingVectorizer instance should be very fast.

The vocabulary based vectorizers are better suited for exploratory analysis on small datasets.

answered Jul 08 '13 at 21:57

ogrisel

39,309
12
116
125

Verygood , i found this out : http://t.co/12cFDYlTil and testing. Can we use Unsupervised Learning (KMeans) ? – Phyo Arkar Lwin Jul 09 '13 at 18:31
On TFIDFVectorizers we can use Randomized PCA for plotting , but HashingVectorizer output is different right? How can we do Scatterplot on that? – Phyo Arkar Lwin Jul 09 '13 at 18:32
Why would it be different? RandomizedPCA can take any sparse matrix as input, what ever the way it was generated. – ogrisel Jul 10 '13 at 08:31
1

If you want to do out-of-core unsupervised learning (clustering) you should use MiniBatchKMeans instead of KMeans. Only the former has a `partial_fit` method. – ogrisel Jul 10 '13 at 08:32
We managed to get those List of HashingVectorizers to work with MBK , but when visualized Resulting Clusters are about 30% different vs TFIDFVectorizer. – Phyo Arkar Lwin Jul 10 '13 at 18:13
2

`HashingVectorizer` does not do IDF weighting. That might be the cause of your problem. You could try to pipeline a `TfidfTransformer` to do the IDF re-weighting on the output of the `HashingVectorizer` manually. – ogrisel Jul 10 '13 at 19:25
Thanks we are testing using TfidfTransformer . I sent you email with multiple links of visualizations screenshots. – Phyo Arkar Lwin Jul 10 '13 at 20:41

score 1 · Answer 2 · answered Feb 10 '15 at 04:49

1

One way to overcome the inability of HashingVectorizer to account for IDF is to index your data into elasticsearch or lucene and retrieve termvectors from there using which you can calculate Tf-IDF.

answered Feb 10 '15 at 04:49

Gireesh Ramji

41
5

How can i reduce memory usage of Scikit-Learn Vectorizers?

2 Answers2

Linked