As far as I can understand examples using the NLTK classifier:
- http://nbviewer.ipython.org/github/carljv/Will_it_Python/blob/master/MLFH/CH3/ch3_nltk.ipynb
- http://www.nltk.org/book/ch06.html
- NLTK classify interface using trained classifier
- Implementing Bag-of-Words Naive-Bayes classifier in NLTK
- http://my.safaribooksonline.com/book/databases/9781783280995/11dot-sentiment-analysis-of-twitter-data/id286781656#X2ludGVybmFsX0h0bWxWaWV3P3htbGlkPTk3ODE3ODMyODA5OTUlMkZpZDI4Njc4MjEwNCZxdWVyeT0=
They seem to work off only functions of the sentence itself. So, you'd have...
corpus =
[
("This is a sentence"),
("This is another sentence")
]
...and you apply some function, like count_words_ending_in_a_vowel() to the sentence itself.
Instead, I'd like to apply a piece of outside data to the sentence, not something derived from the text itself, but an external label, like:
corpus =
[
("This is a sentence", "awesome"),
("This is another sentence", "not awesome")
]
Or
corpus =
[
{"text": "This is a sentence", "label": "awesome"},
{"text": "This is another sentence", "label": "not awesome"}
]
(In the event that I might have multiple outside labels.)
My question is: given that my dataset has these external labels in it, how do I re-format the corpus to the format that NaiveBayesClassifier.train()
expects? I understand I also need to apply the tokenizer on the "text" field above---but what is the total format that I should be entering into NaiveBayesClassifier.train function?
To apply
classifier = nltk.NaiveBayesClassifier.train(goods)
print(classifier.show_most_informative_features(32))
My broader objective---I'd like to be looking at how differential word frequencies are able to predict the label, which sets of words are most informative in separating the labels from each other. This kind of has a k-means feel, but I'm told I should be able to do this entirely within NLTK and am just having trouble phrasing it to the appropriate data input format.