I have a small text file (19250 words / 7433 unique tokens) and wand to build a word matrix.
I tried the same code on Windows (16 GB RAM) and Mac Ios (16 GB). On MAC the code runs smoothly. I keep having Memory Error messages on Windows even if the Physical Memory always has 11 GB + free. I have monitored the use of memory during the running of the script and it was always ways below the maximum limit.
Here is my code:
import pandas as pd
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.feature_extraction.text import CountVectorizer
corpus = open("cleancorpus.txt","r",encoding = "utf8")
def bow_extractor(corpus, ngram_range=(1,1)):
vectorizer = CountVectorizer(analyzer=u'word', binary=False, decode_error=u'strict',
encoding=u'utf-8', input=u'content',
lowercase=True, max_df=1.0, max_features=None, min_df=1,
ngram_range=(1, 1), preprocessor=None, stop_words=None,
strip_accents=None, token_pattern=u'(?u)\\b\\w\\w+\\b',
tokenizer=None, vocabulary=None)
features = vectorizer.fit_transform(corpus)
return vectorizer, features
bow_vectorizer, bow_features = bow_extractor(corpus)
features = bow_features.todense()
# This is where I get the error message.
The error message is as follows:
--------------------------------------------------------------------------- MemoryError Traceback (most recent call last) <ipython-input-107-60fc9be6d9fa> in <module>()
----> 1 features = bow_features.todense()
c:\program files (x86)\python36-32\lib\site-packages\scipy\sparse\base.py in todense(self, order, out)
790 `numpy.matrix` object that shares the same memory.
791 """
--> 792 return np.asmatrix(self.toarray(order=order, out=out))
793
794 def toarray(self, order=None, out=None):
c:\program files (x86)\python36-32\lib\site-packages\scipy\sparse\compressed.py in toarray(self, order, out)
941 if out is None and order is None:
942 order = self._swap('cf')[0]
--> 943 out = self._process_toarray_args(order, out)
944 if not (out.flags.c_contiguous or out.flags.f_contiguous):
945 raise ValueError('Output array must be C or F contiguous')
c:\program files (x86)\python36-32\lib\site-packages\scipy\sparse\base.py in
_process_toarray_args(self, order, out) 1128 return out 1129 else:
-> 1130 return np.zeros(self.shape, dtype=self.dtype, order=order) 1131 1132 def __numpy_ufunc__(self, func, method, pos, inputs, **kwargs):
MemoryError:
It seems to be somehow related to running the code on windows. I have the same library versions on Mac and Windows:
scikit-learn 0.19.1 pandas 0.22.0 scipy 1.0.0 numpy 1.14.2
A month ago I posted a similar problem: "spacy MemoryError for small file" (still unresolved)
Any hint would be welcome!