0

I have a small text file (19250 words / 7433 unique tokens) and wand to build a word matrix.

I tried the same code on Windows (16 GB RAM) and Mac Ios (16 GB). On MAC the code runs smoothly. I keep having Memory Error messages on Windows even if the Physical Memory always has 11 GB + free. I have monitored the use of memory during the running of the script and it was always ways below the maximum limit.

Here is my code:

import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
corpus = open("cleancorpus.txt","r",encoding = "utf8")
def bow_extractor(corpus, ngram_range=(1,1)):
    vectorizer = CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
        encoding=u'utf-8', input=u'content',
        lowercase=True, max_df=1.0, max_features=None, min_df=1,
        ngram_range=(1, 1), preprocessor=None, stop_words=None,
        strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
        tokenizer=None, vocabulary=None)
    features = vectorizer.fit_transform(corpus)
    return vectorizer, features
bow_vectorizer, bow_features = bow_extractor(corpus)
features = bow_features.todense()
# This is where I get the error message.

The error message is as follows:

--------------------------------------------------------------------------- MemoryError                               Traceback (most recent call last) <ipython-input-107-60fc9be6d9fa> in <module>()
----> 1 features = bow_features.todense()

c:\program files (x86)\python36-32\lib\site-packages\scipy\sparse\base.py in todense(self, order, out)
    790             `numpy.matrix` object that shares the same memory.
    791         """
--> 792         return np.asmatrix(self.toarray(order=order, out=out))
    793 
    794     def toarray(self, order=None, out=None):

c:\program files (x86)\python36-32\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
    941         if out is None and order is None:
    942             order = self._swap('cf')[0]
--> 943         out = self._process_toarray_args(order, out)
    944         if not (out.flags.c_contiguous or out.flags.f_contiguous):
    945             raise ValueError('Output array must be C or F contiguous')

c:\program files (x86)\python36-32\lib\site-packages\scipy\sparse\base.py in
_process_toarray_args(self, order, out)    1128             return out    1129         else:
-> 1130             return np.zeros(self.shape, dtype=self.dtype, order=order)    1131     1132     def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs):

MemoryError:

It seems to be somehow related to running the code on windows. I have the same library versions on Mac and Windows:

scikit-learn 0.19.1 pandas 0.22.0 scipy 1.0.0 numpy 1.14.2

A month ago I posted a similar problem: "spacy MemoryError for small file" (still unresolved)

Any hint would be welcome!

FMassion
  • 1
  • 2
  • *c:\program files **(x86)** \python36-32\....* I suspect you are using 32 bit Python on Windows, which has only between 2 and 4 GB adress space ([ref](https://stackoverflow.com/q/639540/3005167)). – MB-F May 22 '18 at 13:45
  • Yes indeed. This was the reason. I replaced the 32 bit version with a 64 bit version and this solved the problem. Thank you! I had a matrix of 7500 ^ 7500 and that was too much for the adress space.... – FMassion May 23 '18 at 15:15

0 Answers0