7

i am trying this Naive Bayes Classifier in python:

classifier = nltk.NaiveBayesClassifier.train(train_set)
print "Naive Bayes Accuracy " + str(nltk.classify.accuracy(classifier, test_set)*100)
classifier.show_most_informative_features(5)

i have the following output:

Console Output

It is clearly visible which words appear more in "important" and which in "spam" category.. But I can't work with these values.. I actually want a list that looks like this:

[[pass,important],[respective,spam],[investment,spam],[internet,spam],[understands,spam]]

I am new to python and having a hard time figuring all these out, can anyone help ? I will be very thankful.

alvas
  • 115,346
  • 109
  • 446
  • 738
Sebastian Gomes
  • 775
  • 9
  • 12

3 Answers3

2

You could slightly modify the source code of show_most_informative_features to suit your purpose.

The first element of the sub-list corresponds to the most informative feature name while the second element corresponds to it's label (more specifically the label associated with numerator term of the ratio).

helper function:

def show_most_informative_features_in_list(classifier, n=10):
    """
    Return a nested list of the "most informative" features 
    used by the classifier along with it's predominant labels
    """
    cpdist = classifier._feature_probdist       # probability distribution for feature values given labels
    feature_list = []
    for (fname, fval) in classifier.most_informative_features(n):
        def labelprob(l):
            return cpdist[l, fname].prob(fval)
        labels = sorted([l for l in classifier._labels if fval in cpdist[l, fname].samples()], 
                        key=labelprob)
        feature_list.append([fname, labels[-1]])
    return feature_list

Testing this on a classifier trained over the positive/negative movie review corpus of nltk:

show_most_informative_features_in_list(classifier, 10)

produces:

[['outstanding', 'pos'],
 ['ludicrous', 'neg'],
 ['avoids', 'pos'],
 ['astounding', 'pos'],
 ['idiotic', 'neg'],
 ['atrocious', 'neg'],
 ['offbeat', 'pos'],
 ['fascination', 'pos'],
 ['symbol', 'pos'],
 ['animators', 'pos']]
Nickil Maveli
  • 29,155
  • 8
  • 82
  • 85
  • Actually, there's already a `most_informative_features()` function in https://github.com/nltk/nltk/blob/develop/nltk/classify/naivebayes.py#L123 I don't think there's a need to reimplement it =) – alvas Mar 23 '17 at 09:59
  • I agree. But that only shows a tabular string outputted data which cannot be stored as it is. OP wants the feature names and it's associated label outputted in a list form. – Nickil Maveli Mar 23 '17 at 10:01
  • 1
    IIUC, those are just the `fname` and `fvals`. He's after `fname` and it's associated `label` (pos/neg distinction) or for his case (spam/ham classification). – Nickil Maveli Mar 23 '17 at 10:05
  • Yes, the e.g. labels from the movie review are boolean True and False. But if the label, it'll return a string. Let me try and verify this, maybe `nltk` would break =) – alvas Mar 23 '17 at 10:06
  • Sorry, I don't think changing the boolean values to pos/neg is correct. But anyways, I've a limited knowledge in this subject so can't argue the reason against. When you scroll down, nice is tagged with `False`. So, does that become negative according to you? – Nickil Maveli Mar 23 '17 at 10:19
  • To check: try and compare your results with the ratio obtained by `classifier.show_most_informative_features(10)`. If they concur, then what you've shown is correct. – Nickil Maveli Mar 23 '17 at 10:29
  • 1
    @NickilMaveli thanks a lot . I wanted the classified tag with each word and your solution was on point. :) – Sebastian Gomes Mar 23 '17 at 10:49
2

Simply use the most_informative_features()

Using the examples from Classification using movie review corpus in NLTK/Python :

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:(i in tokens) for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)

Then, simply:

print classifier.most_informative_features()

[out]:

[('turturro', True),
 ('inhabiting', True),
 ('taboo', True),
 ('conflicted', True),
 ('overacts', True),
 ('rescued', True),
 ('stepdaughter', True),
 ('apologizing', True),
 ('pup', True),
 ('inform', True)]

And to list all features:

classifier.most_informative_features(n=len(word_features))

[out]:

[('turturro', True),
 ('inhabiting', True),
 ('taboo', True),
 ('conflicted', True),
 ('overacts', True),
 ('rescued', True),
 ('stepdaughter', True),
 ('apologizing', True),
 ('pup', True),
 ('inform', True),
 ('commercially', True),
 ('utilize', True),
 ('gratuitous', True),
 ('visible', True),
 ('internet', True),
 ('disillusioned', True),
 ('boost', True),
 ('preventing', True),
 ('built', True),
 ('repairs', True),
 ('overplaying', True),
 ('election', True),
 ('caterer', True),
 ('decks', True),
 ('retiring', True),
 ('pivot', True),
 ('outwitting', True),
 ('solace', True),
 ('benches', True),
 ('terrorizes', True),
 ('billboard', True),
 ('catalogue', True),
 ('clean', True),
 ('skits', True),
 ('nice', True),
 ('feature', True),
 ('must', True),
 ('withdrawn', True),
 ('indulgence', True),
 ('tribal', True),
 ('freeman', True),
 ('must', False),
 ('nice', False),
 ('feature', False),
 ('gratuitous', False),
 ('turturro', False),
 ('built', False),
 ('internet', False),
 ('rescued', False),
 ('clean', False),
 ('overacts', False),
 ('gregor', False),
 ('conflicted', False),
 ('taboo', False),
 ('inhabiting', False),
 ('utilize', False),
 ('churns', False),
 ('boost', False),
 ('stepdaughter', False),
 ('complementary', False),
 ('gleiberman', False),
 ('skylar', False),
 ('kirkpatrick', False),
 ('hardship', False),
 ('election', False),
 ('inform', False),
 ('disillusioned', False),
 ('visible', False),
 ('commercially', False),
 ('frosted', False),
 ('pup', False),
 ('apologizing', False),
 ('freeman', False),
 ('preventing', False),
 ('nutsy', False),
 ('intrinsics', False),
 ('somalia', False),
 ('coordinators', False),
 ('strengthening', False),
 ('impatience', False),
 ('subtely', False),
 ('426', False),
 ('schreber', False),
 ('brimley', False),
 ('motherload', False),
 ('creepily', False),
 ('perturbed', False),
 ('accountants', False),
 ('beringer', False),
 ('scrubs', False),
 ('1830s', False),
 ('analogue', False),
 ('espouses', False),
 ('xv', False),
 ('skits', False),
 ('solace', False),
 ('reduncancy', False),
 ('parenthood', False),
 ('insulators', False),
 ('mccoll', False)]

To clarify:

>>> type(classifier.most_informative_features(n=len(word_features)))
list
>>> type(classifier.most_informative_features(10)[0][1])
bool

Further clarification, if the labels used in the feature set is a string, the most_informative_features() will return a string, e.g.

import string
from itertools import chain

from nltk.corpus import movie_reviews as mr
from nltk.corpus import stopwords
from nltk.probability import FreqDist
from nltk.classify import NaiveBayesClassifier as nbc
import nltk

stop = stopwords.words('english')
documents = [([w for w in mr.words(i) if w.lower() not in stop and w.lower() not in string.punctuation], i.split('/')[0]) for i in mr.fileids()]

word_features = FreqDist(chain(*[i for i,j in documents]))
word_features = list(word_features.keys())[:100]

numtrain = int(len(documents) * 90 / 100)
train_set = [({i:'positive' if (i in tokens) else 'negative' for i in word_features}, tag) for tokens,tag in documents[:numtrain]]
test_set = [({i:'positive' if (i in tokens) else 'negative'  for i in word_features}, tag) for tokens,tag in documents[numtrain:]]

classifier = nbc.train(train_set)

And:

>>> classifier.most_informative_features(10)
[('turturro', 'positive'),
 ('inhabiting', 'positive'),
 ('conflicted', 'positive'),
 ('taboo', 'positive'),
 ('overacts', 'positive'),
 ('rescued', 'positive'),
 ('stepdaughter', 'positive'),
 ('pup', 'positive'),
 ('apologizing', 'positive'),
 ('inform', 'positive')]

>>> type(classifier.most_informative_features(10)[0][1])
str
Community
  • 1
  • 1
alvas
  • 115,346
  • 109
  • 446
  • 738
0

The most informative features (most distinguishing or differentiating tokens) for naive bayes are going to be those values with the largest difference between p ( word | class) between the two classes.

You'll have to do some text manipulation and tokenization first so that you end up with two lists. One list of all tokens present in all strings that were tagged as class A. Another list of all tokens present in all strings that were tagged as class B. These two lists should contains repeated tokens that we can count and create frequency distributions.

Run this code:

classA_freq_distribution = nltk.FreqDist(classAWords)
classB_freq_distribution = nltk.FreqDist(classBWords)
classA_word_features = list(classA_freq_distribution.keys())[:3000]
classB_word_features = list(classB_freq_distribution.keys())[:3000]

This will grab the top 3000 features from each list, but you could pick another number besides 3000. Now you've got a frequency distribution you can compute p ( word | class ) and then look at the differences between the two calsses.

diff = []
features = []
for feature in classA_word_features:
    features.append(feature)

    diff.append(classB_freq_distribution[feature]
    /len(classBWords) 
    - classA_freq_distribution[feature]/len(classAWords))
all_features = pd.DataFrame({
    'Feature': features,
    'Diff': diff
})

Then you can sort and look at the highest and lowest valued words.

sorted = all_features.sort_values(by=['Diff'], ascending=False)
print(sorted)
Rob
  • 216
  • 3
  • 9