1

I need a reliable and accurate method to filter tweets as subjective or objective. In other words I need to build a filter in something like Weka using a training set.

Are there any training sets available which could be used as a subjective/objective classifier for Twitter messages or other domains which may be transferable?

Iterator
  • 20,250
  • 12
  • 75
  • 111
NightWolf
  • 7,694
  • 9
  • 74
  • 121
  • Since there's no such thing as an objective definition of an objective twitter, you're not going to find a pre-existing training set. – bmargulies Aug 01 '11 at 15:41
  • 1
    There are subjective and objective messages posted on twitter... While the training set may not be perfect for all messages something that is 75%+ accurate is enough... I dont think you understand the goal here.... For example you may have positive, negative and neutral tweets. I want to determine which tweets are positive/negative and which are neutral... – NightWolf Aug 02 '11 at 12:58
  • Just a short comment: objective != neutral. A good example is: Delphin is a fish. It is a neutral and subjective opinion on Delphin. – Skarab Aug 02 '11 at 18:38

3 Answers3

2

For research and non-profit purposes, SentiWordNet gives you exactly what you want. A commercial license is available too.

SentiWordNet : http://sentiwordnet.isti.cnr.it/

Sample Jave Code: http://sentiwordnet.isti.cnr.it/code/SWN3.java

Related Paper: http://nmis.isti.cnr.it/sebastiani/Publications/LREC10.pdf


The other approach I would try:

Example

Tweet 1: @xyz u should see the dark knight. Its awesme.

1) First a dictionary lookup for the for meanings.

"u" and "awesme" will not return anything.

2) Then go against the known abbreviations/shorthands and substitute matches with the expansions (Some resources: netlingo http://www.netlingo.com/acronyms.php or smsdictionary http://www.smsdictionary.co.uk/abbreviations)

Now the original tweet will look like:

Tweet 1: @xyz you should see the dark knight. Its awesme.

3) Then feed the remaining words in the spell checker and substitute with the best match (not always ideal and error prone for small words)

Related Link: Looking for Java spell checker library

Now the original tweet will look like:

Tweet 1: @xyz you should see the dark knight. Its awesome.

4) Split and feed the tweet into SWN3, aggregate the result

The problem with this approach is that

a) Negations should be handled outside SWN3.

b) Information in emoticons and exaggerated punctuations will be lost or they need to be handled separately.

Community
  • 1
  • 1
Neodawn
  • 1,086
  • 1
  • 6
  • 9
  • Thanks. Have looked at SentiWordNet which seems good. However the problem here is that Twitter messages tend to be miss-spelt, abbreviated etc so I was thinking it may not be the best approach. Do you know of any Java code which implements word sense disambiguation with SWN3? – NightWolf Aug 02 '11 at 17:29
  • Sorry, I am more a Python guy… :) Some Related Papers that could be of help: http://www.hpl.hp.com/techreports/2011/HPL-2011-89.pdf http://www.stanford.edu/~richab86/CS224N.Go.Bhayani.pdf – Neodawn Aug 02 '11 at 19:58
2

There is sentiment training data at CMU somewhere. I can't remember the link. CMU has done a lot on twitter and sentiment analysis:

I wrote an english vs. not english Naive Bayes classifier for twitter and made a ~example dev/test set and it was 98% accurate. I think that sort of thing is always pretty good if you are just trying to understand the problem, but a package like SentiWordNet might give you a head start.

The problem is defining what makes a tweet subjective or objective! It's important to understand that machine learning is less about the algorithm and more about the quality of the data.

You mention 75% accuracy is all you need.... what about recall? If you provide the right training data you might be able to get that, at the cost of lower recall.

nflacco
  • 4,972
  • 8
  • 45
  • 78
  • Your english vs. not english Naive Bayes classifier for twitter sounds interesting. Any chance this one is on GitHub? – NightWolf Aug 03 '11 at 04:49
  • Unfortunately not, it's on my old desktop box 2000 miles away! The code wasn't complicated at all though, the thing that took the time was labeling all the data. – nflacco Aug 03 '11 at 05:43
1

The DynamicLMClassifier in LingPipe works pretty good.

http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html

y2p
  • 4,791
  • 10
  • 40
  • 56