57

Where can I get a corpus of documents that have already been classified as positive/negative for sentiment in the corporate domain? I want a large corpus of documents that provide reviews for companies, like reviews of companies provided by analysts and media.

I find corpora that have reviews of products and movies. Is there a corpus for the business domain including reviews of companies, that match the language of business?

Iterator
  • 20,250
  • 12
  • 75
  • 111
London guy
  • 27,522
  • 44
  • 121
  • 179
  • See also this related question: http://stackoverflow.com/questions/5570681/what-training-data-sources-could-be-used-for-sentiment-classification-models – John Lehmann Sep 27 '11 at 14:33

6 Answers6

37

http://www.cs.cornell.edu/home/llee/data/

http://mpqa.cs.pitt.edu/corpora/mpqa_corpus

You can use twitter, with its smileys, like this: http://web.archive.org/web/20111119181304/http://deepthoughtinc.com/wp-content/uploads/2011/01/Twitter-as-a-Corpus-for-Sentiment-Analysis-and-Opinion-Mining.pdf

Hope that gets you started. There's more in the literature, if you're interested in specific subtasks like negation, sentiment scope, etc.

To get a focus on companies, you might pair a method with topic detection, or cheaply just a lot of mentions of a given company. Or you could get your data annotated by Mechanical Turkers.

Gregory Marton
  • 1,429
  • 10
  • 12
25

This is a list I wrote a few weeks ago, from my blog. Some of these datasets have been recently included in the NLTK Python platform.

Lexicons


Datasets


References:

Kurt Bourbaki
  • 11,984
  • 6
  • 35
  • 53
12

Here are a few more;

http://inclass.kaggle.com/c/si650winter11

http://alias-i.com/lingpipe/demos/tutorial/sentiment/read-me.html

y2p
  • 4,791
  • 10
  • 40
  • 56
4

If you have some resources (media channels, blogs, etc) about the domain you want to explore, you can create your own corpus. I do this in python:

  • using Beautiful Soup http://www.crummy.com/software/BeautifulSoup/ for parsing the content that I want to classify.
  • separate those sentences meaning positive/negative opinions about companies.
  • Use NLTK to process this sentences, tokenize words, POS tagging, etc.
  • Use NLTK PMI to calculate bigrams or trigrams mos frequent in only one class

Creating corpus is a hard work of pre-processing, checking, tagging, etc, but has the benefits of preparing a model for a specific domain many times increasing the accuracy. If you can get already prepared corpus, just go ahead with the sentiment analysis ;)

Luchux
  • 803
  • 1
  • 7
  • 17
1

I'm not aware of any such corpus being freely available, but you could try an unsupervised method on an unlabeled dataset.

Community
  • 1
  • 1
Fred Foo
  • 355,277
  • 75
  • 744
  • 836
0

You can get a large select of online reviews from Datafiniti. Most of the reviews come with rating data, which would provide more granularity on sentiment than positive / negative. Here's a list of businesses with reviews, and here's a list of products with reviews.

shiondev
  • 11
  • 1
  • 2