trouble with nltk python NaiveBayesClassifier, I keep getting same probabilities inputs correct?

Question

so I'm working on a project its for class "homework" if you will, but what it does is it takes in anime names and genres and if they are relevant or irrelevant I am trying to build a NaiveBayesClassifier with that and then I want to pass in genres and for it to tell me if it is relevant or irrelevant I currently have the following:

import nltk
trainingdata =[({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant'), ({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')]
classifier = nltk.classify.naivebayes.NaiveBayesClassifier.train(trainingdata)
classifier.classify({'Fantasy': True, 'Comedy': True, 'Supernatural': True})
prob_dist = classifier.prob_classify(anime)
print "relevant " + str(prob_dist.prob("relevant"))
print "unrelevant " + str(prob_dist.prob("unrelevant"))

I currently have :

size of training array:110
the relevant length 57
the unrelevant length 53

Some results I receive :

relevant Tantei Opera Milky Holmes TD
input data passed to classify: {'Mystery': True, 'Comedy': True, 'Super': True, 'Power': True}
relevant 0.518018018018
unrelevant 0.481981981982

relevant Juuou Mujin no Fafnir
input data passed to classify :{'Romance': True, 'Fantasy': True, 'School': True}
relevant 0.518018018018
unrelevant 0.481981981982

So it looks like it's not reading my data correctly as 57/110 = .518018 But Im not sure what I am doing wrong...

I looked at this nltk NaiveBayesClassifier training for sentiment analysis

and i feel like I am doing it correctly.. The only thing I am not doing is specifying every specific key that isn't found in keys. Does that matter?

Thanks!

possible duplicate of [Same probabilties for all naive bayes classifications](http://stackoverflow.com/questions/27306150/same-probabilties-for-all-naive-bayes-classifications) — alvas, Dec 06 '14 at 18:58
First asking the same questions multiple times is not cool in StackOverflow. Next homework questions are mostly ignored. Lastly, if you would like someone to answer the question, try asking bite size questions. — alvas, Dec 06 '14 at 18:59
You've posted partial code and there's no clue to what `anime` in your code is. — alvas, Dec 06 '14 at 19:09
what does your input data looks like? What is the NLP problem you're trying to resolve? Why are you taking features rather than training data for your input? What is your expected output. — alvas, Dec 06 '14 at 19:13
So my input data for the classification are what is labeled as input data passed to classify. The training data is features with its relevancy from my understanding. Expected output is different relevant/unrelevant probabilities from my understanding of the algorithm it should display different probabilities for those cases. — user3537288, Dec 06 '14 at 19:28
you're training on two instances. Could you send a link to the full dataset? — alvas, Dec 06 '14 at 19:38
so the code above is a sample of how I am currently doing... I didn't want to put everything because it would be too much... I currently have a json object that I then convert to the format above... the link to the json object is https://raw.githubusercontent.com/alejandrovega44/CSCE-470-Anime-Recommender/naive2/py/UserAnime2 — user3537288, Dec 06 '14 at 19:54
the json file doesn't contain the relevant/unrelevant annotation. — alvas, Dec 06 '14 at 20:00
yea that is why I said the link is just the json object I convert it to something that looks like this "[({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant'), ({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')]" during the execution of the program... So the closest thing to that conversion is the top section that has relevant/unrelevant in this file but this specific file isn't the scenario shown above.. . https://raw.githubusercontent.com/alejandrovega44/CSCE-470-Anime-Recommender/naive2/py/test — user3537288, Dec 06 '14 at 20:03
Why don't you try to extract the 100 documents' feature vector and then put into a pastebin or somewhere and change your code so that it's more similar to your project and we can try to help you. if not the answer i gave would be the best one could help given the context you've given. — alvas, Dec 06 '14 at 20:26

score 2 · Accepted Answer · answered Dec 06 '14 at 20:26

Some background, the OP purpose is to build a classifier for this purpose: https://github.com/alejandrovega44/CSCE-470-Anime-Recommender

Firstly, there are several methodological issues, i terms of what you're calling things.

You training data should be the raw data you're using for your task, i.e. the json file at: https://raw.githubusercontent.com/alejandrovega44/CSCE-470-Anime-Recommender/naive2/py/UserAnime2

And the data structure that you've in your question should be called a feature vector, i.e. :

({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant')
({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')

The features in the training set in your sample code:

'drama'
'mystery'
'horror'
'psychological'
'fantasy',
'romance', 
'adventure',
'science fiction'

But the features in your test set in your sample code are:

'Fantasy'
'Comedy'
'Supernatural'
'Mystery'
'Comedy'
'Super'
'Power'
'Romance'
'Fantasy'
'School'

Because strings are case sensitive, none of your feature in the test data occurs in your training data. Hence the default probability assigned would be 50%-50% for a binary class, i.e.:

import nltk
feature_vectors =[
({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant'), 
({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')]
classifier = nltk.classify.naivebayes.NaiveBayesClassifier.train(feature_vectors)
prob_dist = classifier.prob_classify({'Fantasy': True, 'Comedy': True, 'Supernatural': True})
print "relevant " + str(prob_dist.prob("relevant"))
print "unrelevant " + str(prob_dist.prob("unrelevant"))

[out]:

relevant 0.5
unrelevant 0.5

Even if you give the same documents but with capitalized features, the classifier won't know, e.g.:

import nltk
feature_vectors =[
({'drama': True, 'mystery': True, 'horror': True, 'psychological': True}, 'relevant'), 
({'drama': True, 'fantasy': True, 'romance': True, 'adventure': True, 'science fiction': True}, 'unrelevant')]
classifier = nltk.classify.naivebayes.NaiveBayesClassifier.train(feature_vectors)

doc1 = {'drama': True, 'mystery': True, 'horror': True, 'psychological': True}
prob_dist = classifier.prob_classify(doc1)
print "relevant " + str(prob_dist.prob("relevant"))
print "unrelevant " + str(prob_dist.prob("unrelevant"))
print '----'
caps_doc1 = {'Drama': True, 'Mystery': True, 'Horror': True, 'Psychological':True }
prob_dist = classifier.prob_classify(caps_doc1)
print "relevant " + str(prob_dist.prob("relevant"))
print "unrelevant " + str(prob_dist.prob("unrelevant"))
print '----'

[out]:

relevant 0.964285714286
unrelevant 0.0357142857143
----
relevant 0.5
unrelevant 0.5
----

Without giving more details and a better sample code to debug, this is all we can help on the question. =(

trouble with nltk python NaiveBayesClassifier, I keep getting same probabilities inputs correct?

1 Answers1