1

I am running the following code to create and fit a GaussianNB classifier:

features_train, features_test, labels_train, labels_test = preprocess()

### compute the accuracy of your Naive Bayes classifier
# import the sklearn module for GaussianNB 
from sklearn.naive_bayes import GaussianNB 
import numpy as np

### create classifier 
clf = GaussianNB()

### fit the classifier on the training features and labels    
clf.fit(features_train, labels_train)

Running the above locally:

>>> runfile('C:/.../naive_bayes')
no. of Chris training emails: 4406
no. of Sara training emails: 4383
>>> clf
GaussianNB()

I believe this checks out "preprocess()" because it loads features_train, features_test, labels_train, labels_test successfully.

When I try to clf.score or clf.predict, I get a MemoryError:

>>> clf.predict(features_test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 64, in predict
    jll = self._joint_log_likelihood(X)
  File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 343, in _joint_log_likelihood
    n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
MemoryError
>>> clf.score(features_test,labels_test)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "C:\Python27\lib\site-packages\sklearn\base.py", line 295, in score
    return accuracy_score(y, self.predict(X), sample_weight=sample_weight)
  File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 64, in predict
    jll = self._joint_log_likelihood(X)
  File "C:\Python27\lib\site-packages\sklearn\naive_bayes.py", line 343, in _joint_log_likelihood
    n_ij -= 0.5 * np.sum(((X - self.theta_[i, :]) ** 2) /
MemoryError

I do not think it is a problem with my memory because I do not see a spike in RAM on my task manager, and not near the memory usage on my machine.

I suspect it is something with the Python version and the libraries versions.

Any help in going about diagnosing this is appreciated. I can provide more info as needed.

ximiki
  • 435
  • 6
  • 17
  • It sounds like you are burning through the heap, so you may wish to do something similar to [this](http://stackoverflow.com/a/1681971/1771644) – Michal Frystacky Feb 11 '16 at 18:00
  • @MichalFrystacky so off the top of your head, you don't think it is related to any version-updating, etc.? – ximiki Feb 11 '16 at 18:07
  • Its possible, but the evidence seems to point to using too much data. You might have a problem similar to [this](http://stackoverflow.com/questions/4285185/python-memory-limit) – Michal Frystacky Feb 11 '16 at 18:16
  • @ximiki, is that `4406 + 4383` the total size data you are using? If so, I don't see how memory could be a problem, even though I don't know anything about the emails, nor how you encode them. Do you use a BoW approach? Maybe you could give us some insights of what are you doing in the `preprocess` function. I know you say that you pass it successfully, but maybe there is something in the way you preprocess it that screws the `GaussianNB` – Guiem Bosch Feb 11 '16 at 19:31
  • and if you are worried about `sklearn` version you could test for example the newest in a `virtualenv` – Guiem Bosch Feb 11 '16 at 19:33
  • I think I answered my own question, but want to check with you before I post it. The simple solution was just to run my code in Anaconda 64-bit Python. All problems went away and the analysis was finished in ~30 sec. I am not sure if this is worthy of an "Answer" to a question, but it helped me - perhaps I should post in case someone in future searches for similar problem? – ximiki Feb 11 '16 at 20:36

2 Answers2

1

I believe I answered my question after reading some related posts online (did not use previously answered Stackoverflow posts).

The key for me was to simply move to 64-bit Python via Anaconda. All issues with 'MemoryError' were resolved when the exact same code that was run in 32-bit Python was retried in 64-bit. To my best understanding, this was the only variable that was changed.

Perhaps this is not a very satisfying answer, but it would be nice if this question can remain for others in the future searching for the exact same sklearn MemoryError problem.

ximiki
  • 435
  • 6
  • 17
0

I'm also taking that same Udacity course and I had the same exact problem. I installed Anaconda 64bits and executed the script inside Spyder and everything worked out as expected

Danilo Souza Morães
  • 1,481
  • 13
  • 18