0

I have a Python script that successfully creates, trains and pickles a Naive Bayes classifier for string sentiment analysis. I've adapted code snippets found here and here, which have been great for an informed beginner like myself. However both resources stop short of showing how to use a pickled classifier. Previous StackOverflow answers (here and here) hint at the fact that both the classifier object itself AND the feature vector should be saved to disk and then loaded together for use later, but there's no included syntax for how that ought to be achieved.

EDITS: this code works to train and store the classifier:

...

def get_words_in_descs(descs):
    all_words = []
    for (words, sentiment) in descs:
       all_words.extend(words)
    return all_words

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

def extract_features(document):
    document_words = set(document)
    features = {}
    for word in word_features:
        features['contains(%s)' % word] = (word in document_words)
    return features

training = [
           (['Multipurpose 4140 alloy steel'], 'metal'), 
           (['Easy-to-machine polyethylene tube'], 'plastic'),
           ...
           ] 

word_features = get_word_features(get_words_in_descs(training))
training_set = nltk.classify.apply_features(extract_features, training)

classifier = nltk.NaiveBayesClassifier.train(training_set)

outputFile = open('maxModel.pkl','wb')
pickle.dump(classifier, outputFile)
outputFile.close()

EDITS: Again, the code above works great. My issue is a separate .py file, where I try to unpickle this classifier and then use it to classify a new, previously-unseen string. I thought originally that was because I was taking the classifier away from the word_features, but maybe something else is wrong?

Here is the code that is not working. I now get this error... is it expecting a list someplace? 'dict_keys' object has no attribute 'copy'

...

def get_word_features(wordlist):
    wordlist = nltk.FreqDist(wordlist)
    word_features = wordlist.keys()
    return word_features

with open('maxModelClassifier.pkl', 'rb') as fid:
    loaded_classifier = pickle.load(fid)
    #print(str(loaded_classifier.show_most_informative_features(100)))

#try to use the loaded_classifier:
print(loaded_classifier.classify(get_word_features(['super-cushioning', 'foam', 'sheet', 'adhesive-back', 'polyurethane'])))

Thanks for any insights.

2 Answers2

0

Your code computes features for each tweet and saves them to a file. Didn't you forget something? You never trained the Naive Bayes classifier that your question mentions. (Or if you did, you didn't do it with the training data you show in your code.)

  1. Train a classifier by calling its train() method, passing it the list of labeled feature vectors you have computed.

    classifier = nltk.classify.NaiveBayesClassifier(training)
    

    Note that the training set should be a list of labeled dictionaries, not a list of labeled word lists as you are creating. See chapter 6 of the NLTK book for an example of how to create a labeled feature vector in the right format.

  2. Use a classifier, freshly trained or unpickled, by calling one of the methods classify(), prob_classify(), classify_many() or prob_classify_many(). You'll need to compute features from the input you want to classify, and pass these features to the classification method (obviously without a label, since that's what you want to find out.)

    print(classifier.classify(get_word_features(["What", "is", "this"])))
    
  3. Pickle the trained classifier, not the features. The "syntax" is just pickle.dump(classifier, outputfile).

alexis
  • 48,685
  • 16
  • 101
  • 161
  • Sorry! My excerpt was poorly edited to show what I had actually accomplished. I have expanded the post to show what is working and what is not. The `print(classifier.classify(get_word_features(["What", "is", "this"])))` part still isn't working with the unpickled classifier. Thank you for your help. Does my new-and-improved code make any other issues obvious? – FirstOfThree Jun 29 '17 at 14:46
  • The error you get is because `get_word_features()` returns `wordlist.keys()` (a list-like object), instead of a dictionary as I pointed out in my answer. You train with `extract_features()`, you should be using the same function when classifying. Why do you have two feature functions? It makes no sense. – alexis Jun 29 '17 at 22:39
  • I was using the `extract_features()` because it returns a dictionary (which is what `classify()` seemed to want). I left `get_word_features()` because it was working in the training code even though it returned a list, so I left it. I'll try eliminating `get_word_features()` altogether for consistency. Thanks for your help. – FirstOfThree Jun 30 '17 at 14:16
  • It was "working in the training code" because you never used it for training; certainly not in the code you show. – alexis Jun 30 '17 at 15:02
0

How about using json.dump() for the features vector.

hxysayhi
  • 1,888
  • 18
  • 25