3

I have a data.frame that has week numbers, week, and text reviews, text. I would like to treat the week variable as my grouping variable and run some basic text analysis on it (e.g. qdap::polarity). Some of the review text have multiple sentences; however, I only care about the week's polarity "on-the-whole".

How can I chain together multiple text transformations before running qdap::polarity and adhere to its warning messages? I am able to chain together transformations with the tm::tm_map and tm::tm_reduce -- is there something comparable in qdap? What is the proper way to pre-treat/transform this text prior to running qdap::polarity and/or qdap::sentSplit?

More details in the following code / reproducible example:

library(qdap)
library(tm)

df <- data.frame(week = c(1, 1, 1, 2, 2, 3, 4),
                 text = c("This is some text. It was bad. Not good.",
                          "Another review that was bad!",
                          "Great job, very helpful; more stuff here, but can't quite get it.",
                          "Short, poor, not good Dr. Jay, but just so-so. And some more text here.",
                          "Awesome job! This was a great review. Very helpful and thorough.",
                          "Not so great.",
                          "The 1st time Mr. Smith helped me was not good."),
                 stringsAsFactors = FALSE)

docs <- as.Corpus(df$text, df$week)

funs <- list(stripWhitespace,
             tolower,
             replace_ordinal,
             replace_number,
             replace_abbreviation)

# Is there a qdap function that does something similar to the next line?
# Or is there a way to pass this VCorpus / Corpus directly to qdap::polarity?
docs <- tm_map(docs, FUN = tm_reduce, tmFuns = funs)


# At the end of the day, I would like to get this type of output, but adhere to
# the warning message about running sentSplit. How should I pre-treat / cleanse
# these sentences, but keep the "week" grouping?
pol <- polarity(df$text, df$week)

## Not run:
# check_text(df$text)
JasonAizkalns
  • 20,243
  • 8
  • 57
  • 116

1 Answers1

1

You could run sentSplit as suggested in the warning as follows:

df_split <- sentSplit(df, "text")
with(df_split, polarity(text, week))

##   week total.sentences total.words ave.polarity sd.polarity stan.mean.polarity
## 1    1               5          26       -0.138       0.710             -0.195
## 2    2               6          26        0.342       0.402              0.852
## 3    3               1           3       -0.577          NA                 NA
## 4    4               2          10        0.000       0.000                NaN

Note that I have a breakout sentiment package sentimentr available on github that is an improvment in speed, functionality, and documentation over the qdap version. This does the sentence splitting internally in the sentiment_by function. The script below allows you to install the package and use it:

if (!require("pacman")) install.packages("pacman")
p_load_gh("trinker/sentimentr")

with(df, sentiment_by(text, week))

##    week word_count        sd ave_sentiment
## 1:    2         25 0.7562542    0.21086408
## 2:    1         26 1.1291541    0.05781106
## 3:    4         10        NA    0.00000000
## 4:    3          3        NA   -0.57735027
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519
  • Thank Tyler, I was hoping this would catch your eye. Real quick (and hopefully to help others too), is `sentSplit` and/or `sentiment_by` doing any internal transformations? I'd still like to potentially do some clean-up transformations before doing the sentence splitting or how can I apply transformations before (or after) the call to `sentSplit` but before calculating polarity/sentiment? See the list of functions, `funs` in the question - I don't have time (imemdiately) to look at sentimentr, so if this is covered in its docs, kindly, ignore or feel to point me in the right direction. – JasonAizkalns Dec 02 '15 at 15:54
  • Yes bother are splitting the text at the sentence level. **sentiment** is much more accurate at this task and thus you'll get better results without hand parsing the text. To transform before just operate on the column as a vector. I'd do the cleaning before (most likely) . – Tyler Rinker Dec 02 '15 at 16:54