Performance issue while trying to match a list of words with a list of sentences in R

Question

I am trying to match a list of words with a list of sentences and form a data frame with the matching words and sentences. For example:

words <- c("far better","good","great","sombre","happy")
sentences <- c("This document is far better","This is a great app","The night skies were sombre and starless", "The app is too good and i am happy using it", "This is how it works")

The expected result (a dataframe) is as follows:

sentences                                               words
This document is far better                               better
This is a great app                                       great
The night skies were sombre and starless                  sombre 
The app is too good and i am happy using it               good, happy
This is how it works                                      -

I am using the following code to achieve this.

lengthOfData <- nrow(sentence_df)
pos.words <- polarity_table[polarity_table$y>0]$x
neg.words <- polarity_table[polarity_table$y<0]$x
positiveWordsList <- list()
negativeWordsList <- list()
for(i in 1:lengthOfData){
        sentence <- sentence_df[i,]$comment
        #sentence <- gsub('[[:punct:]]', "", sentence)
        #sentence <- gsub('[[:cntrl:]]', "", sentence)
        #sentence <- gsub('\\d+', "", sentence)
        sentence <- tolower(sentence)
        # get  unigrams  from the sentence
        unigrams <- unlist(strsplit(sentence, " ", fixed=TRUE))

        # get bigrams from the sentence
        bigrams <- unlist(lapply(1:length(unigrams)-1, function(i) {paste(unigrams[i],unigrams[i+1])} ))

        # .. and combine into data frame
        words <- c(unigrams, bigrams)
        #if(sentence_df[i,]$ave_sentiment)

        pos.matches <- match(words, pos.words)
        neg.matches <- match(words, neg.words)
        pos.matches <- na.omit(pos.matches)
        neg.matches <- na.omit(neg.matches)
        positiveList <- pos.words[pos.matches]
        negativeList <- neg.words[neg.matches]

        if(length(positiveList)==0){
          positiveList <- c("-")
        }
        if(length(negativeList)==0){
          negativeList <- c("-")
        }
        negativeWordsList[i]<- paste(as.character(unique(negativeList)), collapse=", ")
        positiveWordsList[i]<- paste(as.character(unique(positiveList)), collapse=", ")

        positiveWordsList[i] <- sapply(positiveWordsList[i], function(x) toString(x))
        negativeWordsList[i] <- sapply(negativeWordsList[i], function(x) toString(x))

    }    
positiveWordsList <- as.vector(unlist(positiveWordsList))
negativeWordsList <- as.vector(unlist(negativeWordsList))
scores.df <- data.frame(ave_sentiment=sentence_df$ave_sentiment, comment=sentence_df$comment,pos=positiveWordsList,neg=negativeWordsList, year=sentence_df$year,month=sentence_df$month,stringsAsFactors = FALSE)

I have 28k sentences and 65k words to match with. The above code takes 45 seconds to accomplish the task. Any suggestions on how to improve the performance of the code as the current approach takes a lot of time?

Edit:

I want to get only those words which exactly matches with the words in the sentences. For example :

words <- c('sin','vice','crashes') 
sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')

Now for the above case my output should be as follows:

sentences                                                           words
Since the app crashes frequently, I advice you guys to fix        crahses
the issue ASAP

Is that any better? `library(stringi) ; sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))` — David Arenburg, Sep 12 '16 at 12:19
@David i used this solution and it reduced the computation time but i need the output as dataframe can you tell me how to acheive that — Venu, Sep 12 '16 at 13:26
`df <- data.frame(sentences) ; df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))`? — David Arenburg, Sep 12 '16 at 13:28
@David awesome !!!! exactly what i wanted. Now i will try to do the computation parallely for both positive and negative word list. — Venu, Sep 12 '16 at 13:48
@DavidArenburg I was trying to use parallel computation. Below is the code i used `## Number of workers (R processes) to use: cores <- detectCores() ## Set up the ’cluster’ cl <- makeCluster(cores-1) df <- data.frame(sentence_df$comment,stringsAsFactors = FALSE) df$posWords <- parSapply(cl=cl,sentence_df$comment, function(x) toString(pos.words[stri_detect_fixed(x, pos.words)]))` but i get an error **could not find function "stri_detect_fixed"** — Venu, Sep 13 '16 at 08:18
See [this](http://stackoverflow.com/questions/23096869/calling-functions-from-non-base-r-packages-in-parallel-package-without-library). Though I'm not sure if parallelizing this will improve performance — David Arenburg, Sep 13 '16 at 08:22
@ David You are right parallelizing it did not improve the performance much. — Venu, Sep 13 '16 at 12:32
@DavidArenburg I face one issue while using the function stri_detect_fixed. It does not look for the exact match of the word, instead if the word is present as a part of another words it gets picked up. For example : `words <- c('sin','vice') sentences <- ('Since the app crashes frequently, I advice you guys to fix the issue ASAP')` Now if i use stri_detect_fixed it matches the word 'sin' with 'since' which is not what i want. Could you help me out with this? — Venu, Sep 21 '16 at 10:37
@DavidArenburg I tried using stri_detect_regex but the script execution takes a lot of time. I used the following code: ` df <- data.frame(sentence_df$comment,stringsAsFactors = FALSE) posW <- paste("\\b",pos.words,"\\b",sep="") df$posWords <- gsub('\\b', '', sapply(sentence_df$comment, function(x) toString(posW[stri_detect_regex(x, posW)])))` Can you tell me what I am missing here? — Venu, Sep 22 '16 at 13:49
Running regex expressions by row is very slow, this is exactly why I've used `stri_detect_fixed`. I don't have time for StackOverflow recently, sorry. — David Arenburg, Sep 22 '16 at 13:57

score 1 · Accepted Answer · answered Oct 03 '16 at 09:03

i was able to use @David Arenburg answer with some modification. Here is what i did. I used the following (suggested by David) to form the data frame.

df <- data.frame(sentences) ; 
df$words <- sapply(sentences, function(x) toString(words[stri_detect_fixed(x, words)]))

The problem with the above approach is that it does not do the exact word match. So I used the following to filter out the words that did not exactly match with the words in the sentence.

df <- data.frame(fil=unlist(s),text=rep(df$sentence, sapply(s, FUN=length)))

After applying the above line the output data frame changes as follows.

sentences                                                      words
This document is far better                                    better
This is a great app                                            great
The night skies were sombre and starless                       sombre 
The app is too good and i am happy using it                    good
The app is too good and i am happy using it                    happy
This is how it works                                            -
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 crahses
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 vice
Since the app crashes frequently, I advice you guys to fix     
the issue ASAP                                                 sin

Now apply the following filter to the data frame to remove those words that are not an exact match to those words present in the sentence.

df <- df[apply(df, 1, function(x) tolower(x[1]) %in% tolower(unlist(strsplit(x[2], split='\\s+')))),]

Now my resulting data frame will be as follows.

    sentences                                                      words
    This document is far better                                    better
    This is a great app                                            great
    The night skies were sombre and starless                       sombre 
    The app is too good and i am happy using it                    good
    The app is too good and i am happy using it                    happy
    This is how it works                                            -
    Since the app crashes frequently, I advice you guys to fix     
    the issue ASAP                                                 crahses

stri_detect_fixed reduced my computation time a lot. The remaining process did not take up much time. Thanks to @David for pointing me out in the right direction.

score 0 · Answer 2 · answered Apr 12 '17 at 14:04

You can do this in the latest version of sentimentr with the extract_sentiment_terms but you'll have to make a sentiment key first and assign value to the words:

pos <- c("far better","good","great","sombre","happy")
neg <- c('sin','vice','crashes') 

sentences <- c('Since the app crashes frequently, I advice you guys to fix the issue ASAP',
    "This document is far better", "This is a great app","The night skies were sombre and starless", 
    "The app is too good and i am happy using it", "This is how it works")

library(sentimentr)
(sentkey <- as_key(data.frame(c(pos, neg), c(rep(1, length(pos)), rep(-1, length(neg))), stringsAsFactors = FALSE)))

##             x  y
## 1:    crashes -1
## 2: far better  1
## 3:       good  1
## 4:      great  1
## 5:      happy  1
## 6:        sin -1
## 7:     sombre  1
## 8:       vice -1

extract_sentiment_terms(sentences, sentkey)

##    element_id sentence_id negative   positive
## 1:          1           1  crashes           
## 2:          2           1          far better
## 3:          3           1               great
## 4:          4           1              sombre
## 5:          5           1          good,happy
## 6:          6           1

Performance issue while trying to match a list of words with a list of sentences in R

2 Answers2