1

What are the best lexicons for document-level and sentence-level analysis? I'm using Vader currently for sentence-level analysis, however I'm worried that when I move to the document level, Vader may not perform as well as others.

Similar question to the post here, however more specific.

berkin
  • 548
  • 6
  • 18
Laurie
  • 1,189
  • 1
  • 12
  • 28

1 Answers1

1

In addition to the sentiment lexica listed in the linked post, I can recommend aFinn sentiment lexicon.

For sentiment analysis, depending on only lexica may not be be best solution, especially on document level. Language is so flexible that its attributes and notions other than sentiment-laden vocabulary effect semantics deeply.

Some of the core notions are contrastive discource markers (especially for document level), negation and modality.

  • contrastive discourse markers

There are opinions that have both pros and cons within documents and we tie those via those markers like 'however', 'nevertheless' etc. to convey meaning or an idea. For a bag of words approach, the sentences below are treated the same, yet if people to annotate their sentiment with one label, they may not annotate them with the same one:

The laptop has amazing features, but its screen is killing me.
The laptop's screen is killing me, but it has amazing features.

In general, we evaluate these kind of sentences or paragraphs with the sentiment of the subclause after 'but'. Other contastive discource markers have their own semantics as well. This is inspected in an area called discource analysis.

  • negation and modality

These notions change semantics as well. So, they cannot be overlooked for both levels. There are studies and papers those used negation and modality triggers with sentiment lexica. You can google it 'negation and modality on sentiment analysis' to see what you can do.

Finally what I can suggest is if you have a domain-specific dataset, you may build your own lexicon using distant supervision.

Hope this helps,

Cheers

berkin
  • 548
  • 6
  • 18
  • 1
    Hi mate, thanks for the comprehensive response. The reason I'm using lexica is because my problem is unsupervised, and the domain-specific data the tool is intended to be used on is currently not available. The specific domain will be finance-based, and contain sentiment-laden commentary regarding appointed staff-tasks and their progress. Vader accounts for negation terms so that shouldnt be too much of a problem, it weights the last segment of the sentence as higher than the first segment. I have a lexicon of positive and negative financial terms, and I'm adding those to vader currently. – Laurie Jul 26 '18 at 08:51
  • Thanks for the suggestion of aFinn, do you know of any others that could be useful? – Laurie Jul 26 '18 at 08:51
  • 1
    You are welcome. There is another lexicon that is built using distant supervision called NRC Sentiment Lexicon out of tweets. However, it may not suit to your domain and the language used in data as well. I extented their approach and created my own lexicon, which is not available online right now. But, I give details here how I built that up in my thesis if you are interested: https://spectrum.library.concordia.ca/980377/1/Ozdemir_MCompSc_F2015.pdf I mention other sentiment analysis resources there as well. Hope it helps. – berkin Jul 26 '18 at 09:10