1

As far as I can understand examples using the NLTK classifier:

They seem to work off only functions of the sentence itself. So, you'd have...

corpus = 
[
("This is a sentence"),
("This is another sentence")
]

...and you apply some function, like count_words_ending_in_a_vowel() to the sentence itself.

Instead, I'd like to apply a piece of outside data to the sentence, not something derived from the text itself, but an external label, like:

corpus = 
[
("This is a sentence", "awesome"),
("This is another sentence", "not awesome")
]

Or

corpus = 
[
{"text": "This is a sentence", "label": "awesome"},
{"text": "This is another sentence", "label": "not awesome"}
]

(In the event that I might have multiple outside labels.)

My question is: given that my dataset has these external labels in it, how do I re-format the corpus to the format that NaiveBayesClassifier.train() expects? I understand I also need to apply the tokenizer on the "text" field above---but what is the total format that I should be entering into NaiveBayesClassifier.train function?

To apply

classifier = nltk.NaiveBayesClassifier.train(goods)
print(classifier.show_most_informative_features(32))

My broader objective---I'd like to be looking at how differential word frequencies are able to predict the label, which sets of words are most informative in separating the labels from each other. This kind of has a k-means feel, but I'm told I should be able to do this entirely within NLTK and am just having trouble phrasing it to the appropriate data input format.

Community
  • 1
  • 1
Mittenchops
  • 18,633
  • 33
  • 128
  • 246

1 Answers1

0

I have had success with the following approach:

train = [({'some': True, 'tokens': True}, 'label'),
         ({'other': True, 'word': True}, 'different label'),
         ({'cool': True, 'document': True}, 'label')]
classifier = nltk.NaiveBayesClassifier.train(train)

So train is a list of documents (each a tuple). The first element of each tuple is a dictionary of tokens (the token is the key and the value is True to indicate the presence of that token) and the second element is a label associated with the document.

ChrisP
  • 5,812
  • 1
  • 33
  • 36
  • Hmm, my data is in the format you're describing, and my classifiers kept returning `>>> print classifier.show_most_informative_features(4) Most Informative Features None `. I assumed that meant I had a syntax error. But it seems it means my data/model are problematic? – Mittenchops Dec 19 '13 at 01:32