0

SO Im trying my hands at sentiment analysis. I heard at lots of places that NaiveBayes is good enough. So I gathered manually some negative comments (~400 ). Then after cleaning up the comments file I finally came up with these top most frequent words for negative comments :-

negative_comments.most_common(40) #Similarly for positive..

[('never', 79),
 ('i', 63),
 ('restaurant', 51),
 ('it', 48),
 ('one', 47),
 ('get', 47),
 ('time', 43),
 ('would', 41),
 ('bad', 39),
 ('service', 38),
 ('don', 36),
 ('us', 36),
 ('work', 35),
 ('family', 35),
 ('day', 35),
 ('please', 32),
 ('stove', 32),
 ('you', 31),
 ('like', 31),
 ('got', 28),
 ('back', 27),
 ('customer', 27),
 ('years', 25),
 ('good', 25),
 ('people', 24),
 ('open', 24),
 ('online', 24),
 ('days', 23),
 ('right', 23),
 ('flea-market', 23),
 ('we', 21),
 ('way', 20)]

As you can see theres hardly any negative word in the top most frequent words. If I use these most frequent to generate my features using NaiveBayes then I dont see any point in the classifier performing any good. Rather I would simply search for words like :-

"dislike","bad", "awful","hate"..

and expect better result than using NaiveBayes on the most_frequent negative words. Is there any better approach than these method?

  • You should ask about how Naive Bayes works here: http://stats.stackexchange.com/ – Reut Sharabani Dec 12 '14 at 12:41
  • 1
    I dont think OP is confused about how NaiveBayes works. Its just that NaiveBayes seems not to be working in his case, hence asking for improvements or suggestion for other algorithm maybe. If Im sensing it right!! –  Dec 12 '14 at 12:45
  • @ReutSharabani I understant NBC have used the example movie_reviews on nltk etc..but Im confused about my case and asking for improvements. –  Dec 12 '14 at 12:48
  • If the sentence or the paragraph is negative doesn't mean that all the words have negative polarity. – badc0re Dec 12 '14 at 13:00
  • It's also not how you'd select words to classify by. The words you've selected are the **common** words (like 'i', which means very little). What you should be selecting are the most **informative** words, these are words that'll help you split the data - or in your case be common in either "good " or "bad" group. NLTK should have an explanation on that in their tutorials. For good results you may want to dive in to most informative n-grams for combinations of words (like "bad" versus "not bad"). Good luck. – Reut Sharabani Dec 12 '14 at 13:02
  • @rzach how did you train your naive bayes classifier? what classes did you have? what data did you use to train it? – user823743 Dec 12 '14 at 13:49
  • 1
    You should maybe also try bi-gram, tri-gram features. Though I cant recall much improvement using them but give it a try. –  Dec 12 '14 at 14:23
  • @user823743 I trained NBC using the above features. i.e. - {has "never" :True, has "restaurant" : False......} –  Dec 12 '14 at 17:09

1 Answers1

0

This is not the way to approach your problem. You have assumed that in a corpus of 400 negative comments you will find mostly negative words, right? This assumption is in most cases incorrect. The most common words that you would find are stopwords, such as 'I', 'it', 'you', 'we', etc. and some words which show the general topic of your corpus. However, if you would like to follow your approach, what you should do is to first remove the top N common words (N depends on the dataset). Then, finding the most common words might lead you to polar words. I said might because although this approach is correct (by a probability depending on the data) it has a huge noise. Now, if you want to do sentiment analysis, why don't you use a sentiment lexicon for training an NB. You can read my answer about sentiment lexicons here. There are many ways to solve your problem, but because I don't know anything about your dataset, I can't judge. Let me know if you had further questions.

Community
  • 1
  • 1
user823743
  • 2,152
  • 3
  • 21
  • 31
  • Hi I guess you are talking about AFINN, but then how would it deal with a comment like : "I want to eat this pizza so badly". I guess polarity should come at second level, right now Im stuck at first level (i.e.e pos/neg ), then comes polarity. My data is restaurant/food reviews data. From their facebook page,twitter etc. –  Dec 13 '14 at 07:19
  • @rzach Hi, I didn't say you should use AFINN; because its a very basic sentiment lexicon not suitable for achieving high accuracies. A lexicon can also be equipped with a phrase list, including the phrase you mentioned! Also, we always assess the accuracy of a classifier statistically. One instance of misclassification is statistically significant, but not significant in the sense that we'd conclude that the classifier is totally falwd. What do you mean by "first level (i.e.e pos/neg )"? Finally, I also told you the correct way of proceeding with your own approach. You can try it out. – user823743 Dec 13 '14 at 23:05