4

I'm trying to train a NLTK classifier for sentiment analysis and then save the classifier using pickle. The freshly trained classifier works fine. However, if I load a saved classifier the classifier will either output 'positive', or 'negative' for ALL examples.

I'm saving the classifier using

classifier = nltk.NaiveBayesClassifier.train(training_set)
classifier.classify(words_in_tweet)
f = open('classifier.pickle', 'wb')
pickle.dump(classifier, f)
f.close()

and loading the classifier using

f = open('classifier.pickle', 'rb')
classifier = pickle.load(f)
f.close()
classifier.classify(words_in_tweet)

I'm not getting any errors. Any idea what the problem could be, or how to debug this correctly?

Eli
  • 36,793
  • 40
  • 144
  • 207
Exzone
  • 53
  • 4
  • Can you show how you're training and using the classifier? The code above looks fine. – Eli Apr 19 '16 at 19:04
  • The classifier is trained using `classifier = nltk.NaiveBayesClassifier.train(training_set)` and used by `classifier.classify(tweet_features)`. As I said, if I'm freshly training a classifier and applying that to new data, it works just fine, just the loaded one is messed up. – Exzone Apr 19 '16 at 19:08
  • Still trying to understand your problem: can you add more description about what you're expecting in both cases? A sentiment analysis classifier outputting "positive" or "negative" is reasonable if that's what it was trained on. – Eli Apr 19 '16 at 19:14
  • Its output should actually be 'positive' or 'negative'. If I'm testing the freshly trained classifier on new data (I'm getting 1000 new Tweets for that) it outputs something like "positive: 600", " negative: 400". However, the loaded one will always output "positive: 1000", "negative: 0" or "positive: 0" ,"negative:1000". Sorry, if I have not made myself clear enough. – Exzone Apr 19 '16 at 19:29
  • This works fine for me. I can't really help without more information. Can you simplify the problem with just a couple of tweets and paste your actual data? – Eli Apr 19 '16 at 20:01
  • 1
    Can you _prove_ that the unpickled classifier is finding and using the same feature extraction function as the original classifier? That's where it usually goes wrong. – alexis Apr 19 '16 at 20:16
  • Can you show the code before the `nltk.NaiveBayesClassifier.train(training_set)`? Otherwise we can't help you much as it's unclear what's inside `training_set`. – alvas Apr 20 '16 at 01:27
  • Also, take a look at http://stackoverflow.com/questions/21107075/classification-using-movie-review-corpus-in-nltk-python – alvas Apr 20 '16 at 01:28
  • @alexis I am an idiot... I totally forgot to store/import the features... – Exzone Apr 20 '16 at 06:26
  • Well, I hope that solved your problem. If not, please take a look at my answer (which I also added for the possible benefit of future readers). – alexis Apr 20 '16 at 20:57

1 Answers1

1

The most likely place a pickled classifier can go wrong is with the feature extraction function. This must be used to generate the feature vectors that the classifier works with.

The NaiveBayesClassifier expects feature vectors for both training and classification; your code looks as if you passed the raw words to the classifier instead (but presumably only after unpickling, otherwise you wouldn't get different behavior before and after unpickling). You should store the feature extraction code in a separate file, and import it in both the training and the classifying (or testing) script.

I doubt this applies to the OP, but some NLTK classifiers take the feature extraction function as an argument to the constructor. When you have separate scripts for training and classifying, it can be tricky to ensure that the unpickled classifier successfully finds the same function. This is because of the way pickle works: pickling only saves data, not code. To get it to work, just put the extraction function in a separate file (module) that your scripts import. If you put in in the "main" script, pickle.load will look for it in the wrong place.

alexis
  • 48,685
  • 16
  • 101
  • 161