0

I'm running a text machine learning algorithm which generates n-grams. This of course massively balloons the size of the input. To put this in context, the original input is ~30K lines in a file, after generation of trigrams I have 348000 entries.

I'm using scikit with its TfidfVectorizer and if too many values are given, I get a MemoryError thrown by the numpy arrays inside. I'm only able to use ~27500 trigrams before I hit the limit. This means I can only use at most 10% of the available data.

What can I do to help remedy this problem? Do I have any options?

ali_m
  • 71,714
  • 23
  • 223
  • 298
Dylan Lawrence
  • 1,503
  • 10
  • 32
  • 2
    I believe this is really an operating system question, as it is via the OS that one can allocate a certain amount of disk space as paging or virtual memory. But it will slow things down *incredibly*. It's probably better to either find a machine with more memory, or to rethink your approach altogether. – jme Nov 27 '15 at 00:48
  • See [this question](http://stackoverflow.com/questions/5537618/memory-errors-and-list-limits-in-python) for some information. Make sure you're using 64-bit Python and not 32-bit. – bbayles Nov 27 '15 at 00:58

1 Answers1

1

As mentioned by @jme, python has no influence on the OS's memory management. Probably the most reasonable approach (if you cannot find a machine with more RAM) would be to somehow limit the number of features, e.g. with one of the following options parameters for TfidfVectorizer:

max_df : float in range [0.0, 1.0] or int, default=1.0

When building the vocabulary ignore terms that have a document frequency strictly higher than the given threshold (corpus-specific stop words). If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

min_df : float in range [0.0, 1.0] or int, default=1

When building the vocabulary ignore terms that have a document frequency strictly lower than the given threshold. This value is also called cut-off in the literature. If float, the parameter represents a proportion of documents, integer absolute counts. This parameter is ignored if vocabulary is not None.

max_features : int or None, default=None

If not None, build a vocabulary that only consider the top max_features ordered by term frequency across the corpus.

This parameter is ignored if vocabulary is not None.

Community
  • 1
  • 1
thomas
  • 1,773
  • 10
  • 14
  • I actually found my problem. It turns out that the vectorizer is very efficient, but if you call toarray() on the vector it kills the memory. Once I realized that, things became better. – Dylan Lawrence Nov 27 '15 at 20:19