NLTK NaiveBayesClassifier is extremely slow in Python?

Question

I'm using the NLTK NaiveBayesClassifier for Sentiment analysis. The whole thing is incredibly slow. I've tried even saving my trainer data so I don't have to retrain each time, I notice no difference in speed/time..

To save:

import cPickle
f = open('my_classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()

To load later:

import cPickle
f = open('my_classifier.pickle')
classifier = pickle.load(f)
f.close()

What else can I do to just improve the speed? It takes 6 seconds for analyses a sentence.. I would like <1 second (I'm running this on a website).

*Now I've changed to saving/loading with cPickle instead of pickle and the performance has dropped to 3 seconds!

I categorize a list of about 15,000 words as either positive or negative. — user3912889, Sep 01 '14 at 04:50

Fred Foo · Answer 1 · 2014-09-01T09:09:11.903

NLTK is a teaching toolkit; it's not really optimized for speed. If you want a fast naive Bayes classifier, use the one from scikit-learn. There's a wrapper for this in NLTK (although straight scikit-learn will still be faster).

Furthermore, scikit-learn models can be loaded quickly if you use memory mapping. First, train the model and store it with

# Let "clf" be your classifier, usually a Pipeline of CountVectorizer
# and MultinomialNB
from sklearn.externals import joblib
joblib.dump(clf, SOME_PATH, compress=0)  # turn off compression

and load it with

clf = joblib.load(SOME_PATH, mmap_mode='r')

This also allows sharing the model between worker processes cheaply.

If it's still too slow, then make sure you process batches of documents instead of one at a time. That can be orders of magnitude faster.

Disclaimer: I wrote much of the naive Bayes in scikit-learn and the NLTK scikit-learn wrapper code.

Umm why isn't this scikit NaiveBayesClassifier the default for NLTK if it is much faster?? I'm crying! — user3912889, Sep 01 '14 at 15:13
@user3912889 Because it's older, less readable if you're not familiar with linear algebra and SciPy, and it drags in a scikit-learn dependency. As I said, NLTK is a teaching tool. It was never meant to be used in production. — Fred Foo, Sep 01 '14 at 15:37

tripleee · Answer 2 · 2014-09-01T09:11:08.283

If you really are pulling in ~~4 million~~15,000 features to analyze maybe a dozen words, most of the features won't be used. This suggests using some sort of disk-based database for the features instead, and pulling in only the ones you need. Even for a long sentence and an inefficient database, 4 seeks x 50 words is still way less than what you see now -- maybe hundreds of milliseconds in the worst case, but certainly not multiple seconds.

Look at anydbm with an NDBM or GDBM back-end for a start, then maybe consider other back-ends depending on familiarity and availability.

Your follow-up comments seem to suggest a basic misunderstanding of what you are doing and/or how things are supposed to work. Let's make a simple example with five words in the lexicon.

# training
d = { 'good': 1, 'bad': -1, 'excellent': 1, 'poor': -1, 'great': 1 }
c = classifier(d)
with open(f, "classifier.pickle", "w") as f:
    pickle.dump(c, f)


sentences = ['I took a good look', 'Even his bad examples were stunning']

# classifying, stupid version
for sentence in sentences:
    with open(f, "classifier.pickle", "r") as f:
        c = pickle.load(f)
    sentiment = c(sentence)
    # basically,  for word in sentence.split(): if word in d: sentiment += d[word]
    print sentiment, sentence

# classifying, slightly less stupid version
with open(f, "classifier.pickle", "r") as f:
    c = pickle.load(f)
# FastCGI init_end here
for sentence in sentences:
    sentiment = c(sentence)
    print sentiment, sentence

The stupid version appears to be what you are currently experiencing. The slightly less stupid version loads the classifier once, and then runs it on each of the input sentences. This is what FastCGI will do for you: you can do the loading part in the process start-up once, and then have a service running which runs it on input sentences as they come in. This is resource-efficient but a bit of work, because converting your script to FastCGI and setting up the server infrastructure is a hassle. If you expect heavy use, it's definitely the way to go.

But observe that only two features out of the five in the model are actually ever needed. Most of the words in the sentences do not have a sentiment score, and most of the words in the sentiments database are not required to calculate a score for these inputs. So a database implementation would instead look something like (rough pseudocode for the DBM part)

with opendbm("sentiments.db") as d:
    for sentence in sentences:
        sentiment = 0
        for word in sentence.split():
            try:
                sentiment += d[word]
            except KeyError:
                 pass
         print sentiment, sentence

The cost per transaction is higher, so it is less optimal than the FastCGI version, which only loads the whole model into memory at start-up; but it does not require you to keep state or set up the FastCGI infrastructure, and it is a lot more efficient than the stupid version which loads the entire model for each sentence.

(In reality, for a web service without FastCGI, you would effectively have the opendbm inside the for instead of the other way around.)

http://stackoverflow.com/a/9713818/874188 suggests `dbhash` if you are on Wintendo (bless your soul). — tripleee, Sep 01 '14 at 04:22
You're a genius tripleee! It makes sense. I don't know why most NLTK tutorials have this type of thinking. — user3912889, Sep 01 '14 at 04:52
Also I don't think I'm using anywhere near 4 million features.. I'm just using 600kb text file of positive sentences and another 500kb file of negative sentences. I get a 2 second delay. I haven't tried your DB idea but I most likely will implement that in the future. I just don't think that's the reason why I'm suffering with a 2 second delay right now.. Is python really this bad? — user3912889, Sep 01 '14 at 05:17
Also why can't I use all my features? Logically, wouldn't this ruin accuracy? So now I'm sacrificing accuracy for efficiency. And I'd still have to search through the feature database to see if there are word matches with the sentences I'm examining. So if the sentence has 6 words.. I'll have to search for 6 different words in each sentence in the database features. The smartest decision seems to be to change programming languages. — user3912889, Sep 01 '14 at 05:23
@tripleee Wintendo is unsupported by more than 10 years. Would you really run a web server with a 10 year old operating system which was pretty crappy in regard to security to begin with? — pqnet, Sep 01 '14 at 07:26
user3912889: See updated answer. The smartest thing would be to explain your processing model if it differs from what I have speculatively assumed here. We can't help you if we don't understand what you are doing. — tripleee, Sep 01 '14 at 08:50
In particular, your feature model is probably more complex than the naive one in the example -- potentially, a lot more so --, but you have not explained what you are using, so we have to guess. — tripleee, Sep 01 '14 at 08:54
Hmm. I honestly don't see the point in all this. Why should I do all this work to speed up my python program when I can do the basic work in C++ and get better or equal speed?! This is like giving steroids to a midget horse... get a bigger horse. — user3912889, Sep 01 '14 at 15:11
If you make the same mistakes in C++ the code won't be much faster. — tripleee, Sep 01 '14 at 16:33

pqnet · Answer 3 · 2014-09-01T07:23:06.073

0

I guess that the pickle save format just saves the training data and it re-calculates the model again every time you load it.

You shouldn't reload the classifier every time you classify a sentence. Can you write the web service in such a way that it can process more than one request at a time?

I never used Asp.net and IIS. I looked around and it seems like it is possible to configure IIS to use FastCGI by installing this extension (here the configuration instructions). How to write your python script so that it is compatible with FastCGI is explained here.

edited Sep 01 '14 at 07:23

answered Aug 31 '14 at 19:30

pqnet

6,070
1
30
51

That is incredibly smart! The thing is, I need to access the file at different time periods. So one time I'll find the Sentiment of "This is the best!" and then say.. 3 minutes later I'll find the Sentiment of "This is the worst!" How would I be able to do that with C# asp.net and Python? – user3912889 Aug 31 '14 at 22:42
Depends how you call it from C#. The traditional fix is something like FastCGI but I don't know if that's available for asp.net. – tripleee Sep 01 '14 at 03:56
1

@user3912889 I added information about FastCGI. Would really help (you to get help) if you stopped your sarcastic comments every time you don't understand something. No need to rant about python being slow, if you do something as stupid as reloading your model from a serialized format every time you need to process a request. If I keep my computer shut down it is not strange that I can't reply to e-mail in seconds. – pqnet Sep 01 '14 at 07:30

NLTK NaiveBayesClassifier is extremely slow in Python?

3 Answers3