I am trying to train a MultinomialNB classifier on a huge set of data (features as well as targets, (about 75k x 130k)). I am aware of the fact, that this classifier will generate a distinct one for each target, thus the memory is expected to explode.
However, the process won't allocate more than about 20GB of RAM even though the machine has about 640GB.
I have tried to set memory lock, tried to run as root (which I have to to adjust these limits), but it won't work.
Traceback (most recent call last):
File "test_classifiers.py", line 202, in <module>
train_mb()
File "test_classifiers.py", line 168, in train_mb
mb_classifier.partial_fit(X, y, list(set(y)))
File "/usr/local/lib/python3.5/dist-packages/sklearn/naive_bayes.py", line 539, in partial_fit
Y = label_binarize(y, classes=self.classes_)
File "/usr/local/lib/python3.5/dist-packages/sklearn/preprocessing/label.py", line 657, in label_binarize
Y = Y.toarray()
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/compressed.py", line 1024, in toarray
out = self._process_toarray_args(order, out)
File "/usr/local/lib/python3.5/dist-packages/scipy/sparse/base.py", line 1186, in _process_toarray_args
return np.zeros(self.shape, dtype=self.dtype, order=order)
MemoryError
resource.setrlimit(resource.RLIMIT_MEMLOCK, (-1, -1))
and
resource.setrlimit(resource.RLIMIT_MEMLOCK, (resource.RLIM_INFINITY, resource.RLIM_INFINITY))
Have been tried, any Ideas? Does it correlate to the fact, that only one CPU can be used, using this classifier?