Calculate sentiment of each row in a big dataset using R

Question

I having trouble calculating average sentiment of each row in a relatively big dataset (N=36140). My dataset containts review data from an app on Google Play Store (each row represents one review) and I would like to calculate sentiment of each review using sentiment_by() function. The problem is that this function takes a lot of time to calculate it.

Here is the link to my dataset in .csv format:

https://drive.google.com/drive/folders/1JdMOGeN3AtfiEgXEu0rAP3XIe3Kc369O?usp=sharing

I have tried using this code:

library(sentimentr)
e_data = read.csv("15_06_2016-15_06_2020__Sygic.csv", stringsAsFactors = FALSE)
sentiment=sentiment_by(e_data$review)

Then I get the following warning message (After I cancel the process when 10+ minutes has passed):

Warning message:
Each time `sentiment_by` is run it has to do sentence boundary disambiguation when a
raw `character` vector is passed to `text.var`. This may be costly of time and
memory.  It is highly recommended that the user first runs the raw `character`
vector through the `get_sentences` function.

I have also tried to use the get_sentences() function with the following code, but the sentiment_by() function still needs a lot of time to execute the calculations

e_sentences = e_data$review %>%
  get_sentences() 
e_sentiment = sentiment_by(e_sentences)

I have datasets regarding the Google Play Store review data and I have used the sentiment_by() function for the past month and it worked very quickly when calculating the sentiment... I only started to run calculations for this long since yesterday.

Is there a way to quickly calculate sentiment for each row on a big dataset.

score 1 · Accepted Answer · answered Aug 22 '20 at 10:23

The algorithm used in sentiment appears to be O(N^2) once you get above 500 or so individual reviews, which is why it's suddenly taking a lot longer when you upped the size of the dataset significantly. Presumably it's comparing every pair of reviews in some way?

I glanced through the help file (?sentiment) and it doesn't seem to do anything which depends on pairs of reviews so that's a bit odd.

library(data.table)
reviews <- iconv(e_data$review, "") # I had a problem with UTF-8, you may not need this
x1 <- rbindlist(lapply(reviews[1:10],sentiment_by))
x1[,element_id:=.I]
x2 <- sentiment_by(reviews[1:10])

produce effectively the same output which means that the sentimentr package has a bug in it causing it to be unnecessarily slow.

One solution is just to batch the reviews. This will break the 'by' functionality in sentiment_by, but I think you should be able to group them yourself before you send them in (or after as it doesnt seem to matter).

batch_sentiment_by <- function(reviews, batch_size = 200, ...) {
  review_batches <- split(reviews, ceiling(seq_along(reviews)/batch_size))
  x <- rbindlist(lapply(review_batches, sentiment_by, ...))
  x[, element_id := .I]
  x[]
}

batch_sentiment_by(reviews)

Takes about 45 seconds on my machine (and should be O(N) for bigger datasets.

Worked like a charm. Thank you very much. It took your function less than 60 seconds to calculate sentiment for each app with 30k+ rows. — tantal148, Aug 22 '20 at 14:54

Calculate sentiment of each row in a big dataset using R

1 Answers1