5

I'd like to apply qdap's polarity function to a vector of documents, each of which could contain multiple sentences, and obtain the corresponding polarity for each document. For example:

library(qdap)
polarity(DATA$state)$all$polarity
# Results:
 [1] -0.8165 -0.4082  0.0000 -0.8944  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000
Warning message:
In polarity(DATA$state) :
  Some rows contain double punctuation.  Suggested use of `sentSplit` function.

This warning can't be ignored, as it seems to add the polarity scores of each sentence in the document. This can result in document-level polarity scores outside the [-1, 1] bounds.

I'm aware of the option to first run sentSplit and then average across the sentences, perhaps weighting polarity by word count, but this is (1) inefficient (takes roughly 4x as long as running on the full documents with the warning), and (2) unclear how to weight sentences. This option would look something like this:

DATA$id <- seq(nrow(DATA)) # For identifying and aggregating documents 
sentences <- sentSplit(DATA, "state")
library(data.table) # For aggregation
pol.dt <- data.table(polarity(sentences$state)$all)
pol.dt[, id := sentences$id]
document.polarity <- pol.dt[, sum(polarity * wc) / sum(wc), "id"]

I was hoping I could run polarity on a version of the vector with periods removed, but it seems that sentSplit does more than that. This works on DATA but not on other sets of text (I'm unsure of the full set of breaks other than periods).

So, I suspect the best way of approaching this is to make each element of the document vector look like one long sentence. How would I do this, or is there another way?

Max Ghenis
  • 14,783
  • 16
  • 84
  • 132
  • Removing the endmarks is extra work if you just want to ignore the warnings. Your results are the same, so it seems you just don't want a warning. First I'd say if its interactive then you could just ignore a warning as its just a flag saying this could be bad. If you really want to suppress the warning then use `suppressWarnings` and forgo the trick as the stripping of punctuation just takes extra time. Also note that `polarity`'s algorithm is no longer bounded at -1 and 1. – Tyler Rinker Apr 01 '14 at 13:27
  • @TylerRinker if the algorithm truly works the same (i.e. considers the string as a single long sentence) then I have no problem ignoring the warning, but the results differ (see answer). – Max Ghenis Apr 01 '14 at 18:07
  • 1
    Max this put me on to a bugglet int he code. It has to do with number of words and comma handling. Thanks for the find. I credited you with the find: https://github.com/trinker/qdap/blob/master/NEWS.md – Tyler Rinker Apr 02 '14 at 01:12
  • Awesome, thanks Tyler! So then in the prior version, would you say the values in my question or answer are more correct? Also, any estimate when you'll push the change to CRAN? – Max Ghenis Apr 02 '14 at 01:27
  • Max I just pushed to CRAN today so in a week. You can use the [GitHub](https://github.com/trinker/qdap) if it's dire for now. Both the approaches are different than if you split by sentence, but if you have to chose one with the prior version the question version is better as it doesn't ignore how commas function. – Tyler Rinker Apr 02 '14 at 01:40
  • OK, could you explain how commas and periods affect the polarity score (assuming I went off your fixed version)? In general I'm looking for the most appropriate way to estimate polarity of a full document. I'm guessing punctuation breaks up amplifiers, so "very, good" and "very. Good" are both less positive than "very good", is that right? – Max Ghenis Apr 02 '14 at 02:05
  • Your explanation is correct. `a <- c("very, good", "very. Good" , "very good"); polarity(a, id(a))` if you had split on the period in element 2. Otherwise the period is stripped and ignored. Anything before a comma within the polarized context cluster is ignored. `round(scores(polarity(a, id(a)))[, 4], 3); #0.707 1.273 1.273` – Tyler Rinker Apr 02 '14 at 02:32
  • Got it. So to determine document polarity, it sounds like replacing sentence breaks with commas would actually be best, as amplifiers are correctly split up. Would you agree? Do any other characters split amplifiers? Or if you disagree and recommend sentSplit first, what would be the best weighting? – Max Ghenis Apr 02 '14 at 03:50
  • Yes you got it. No other characters split amplifiers as all other punctuation is stripped out almost immediately after the warning is thrown. – Tyler Rinker Apr 02 '14 at 04:41
  • New version pushed to CRAN 4/8/14 – Tyler Rinker Apr 08 '14 at 15:44

2 Answers2

2

Max found a bug in this version of qdap (1.3.4) that counted a place holder as a word which affect the equation since the denominator is sqrt(n) where n is the word count. As of 1.3.5 this has been corrected, hence why the two different outputs did not match.

Here is the output:

library(qdap)
counts(polarity(DATA$state))[, "polarity"]

## > counts(polarity(DATA$state))[, "polarity"]
##  [1] -0.8164966 -0.4472136  0.0000000 -1.0000000  0.0000000  0.0000000  0.0000000
##  [8] -0.5773503  0.0000000  0.4082483  0.0000000
## Warning message:
## In polarity(DATA$state) : 
##   Some rows contain double punctuation.  Suggested use of `sentSplit` function.

In this case using strip does not matter. It may in more complex situations involving amplifiers, negators, negatives, and commas. Here is an example:

## > counts(polarity("Really, I hate it"))[, "polarity"]
## [1] -0.5
## > counts(polarity(strip("Really, I hate it")))[, "polarity"]
## [1] -0.9

see the documentation for more.

Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
0

Looks like removing punctuation and other extras tricks polarity into thinking the vector is a single sentence:

SimplifyText <- function(x) {
  return(removePunctuation(removeNumbers(stripWhitespace(tolower(x))))) 
}
polarity(SimplifyText(DATA$state))$all$polarity
# Result (no warning)
 [1] -0.8165 -0.4472  0.0000 -1.0000  0.0000  0.0000  0.0000 -0.5774  0.0000
[10]  0.4082  0.0000 
Max Ghenis
  • 14,783
  • 16
  • 84
  • 132