0

I am currently working on a sentiment analysis project for the first time that will use tweets as input. The topic of these tweets is sports, currently i am preprocessing the data and trying to assign a polarity to them. The many different ways of assigning these sentiment scores is confusing me a bit, thus i have some questions:

  1. This thread (Training data for sentiment analysis) list some corpora, however none of them applies to sports. Can I use one of these to train a classifier that applies to my case? Or would the use of a irrelevant corpus skew the results?

  2. Could it be possible to achieve good results by relying on a lexicon for this topic (e.g. above link)?

  3. Should I query my db and manually annotate tweets in order to train a classifier?

Thanks

PPK
  • 9
  • 1

1 Answers1

0
  1. In general Sentiment analysis will always be affected by using generic corpus because some domains have lingo that will not be accounted for. However it might not significantly affect your results because words like bad or great are heavily polarized regardless of your domain area.

  2. Yes, but if you were implementing a product you'd want to create/find a corpus more tailored to your target domain language corpus, and make sure the results are not statistically significantly different.

  3. No? If you find a corpus with weights attached to words you'd train a classifier on that. Otherwise you would have to determine the weights yourself.

Tony
  • 1,318
  • 1
  • 14
  • 36