I'm working on a C++ Twitter company sentiment analysis tool. User inputs a company and the tool analyzes a # of tweets and returns a sentiment.
So far I did the following:
- limit tweets to English and recent
- make lowercase
- remove RT, # symbol, @usernames and URLs
- remove characters like &^%$(){}... etc
I then parse the tweet into words and check words against two dictionaries of positive and negative words. I create a total sentiment for each tweet. Then I count the number of positive , neutral and negative tweets to come up with a final answer. No weights are used.
I am thinking of implementing the following two things:
- remove stop words from tweets
- remove special characters and emoticons from tweets (non english Unicode basically)
However, even with this, most of the searches end up being very neutral. For example if I search "Apple" in 100 tweets I get say 30 positives, 10 negatives and 60 neutral.
Questions:
1. Is there any way to lower the neutrals?
2. What kind of positive and negative words should I add to represent my search criteria(Companies)