string matching in R

Question

I have 4 words. They are wordA, wordB, wordX and wordY. I have a data set which consists of 1 column (message) and data type of message column is factor. I want to count the total number of occurrences of (wordX and wordY) and then subtracts it from occurrences of (wordA and wordB) in each row and then puting the result in a new column in the row.

For example if text of a message column is "wordD wordA wordX wordA wordC wordA wordB wordY" then the value should be equal to wordA-wordX+wordA+wordA+wordB-wordY= 1-1+1+1+1-1= +2 .

I wrote this code but it doesn't count duplicated words. I appreciate if you could help me.

for(i in 1:nrow(dataset){
counter=0

if(length(grep("wordA",dataset[i,1],)==1)){
counter=counter+1;
}
if(length(grep("wordB",dataset[i,1])==1)){
counter=counter+1;
}
if(length(grep("wordX",dataset[i,1])==1)){
counter=counter-1;
}
if(length(grep("wordY",dataset[i,1])==1)){
counter=counter-1;
}
dataset[i,2]=counter;
}

Please check this [link](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example). A good reproducible example will help others to tackle your question lot more easily. — CHP, Nov 07 '13 at 04:31
There seems to be a confusion of "rows" and "columns" in the problem description, and vagueness in what is being matched. Produce code that created set_A and set_B and explain why that "message column" should have a value or +2. — IRTFM, Nov 07 '13 at 04:35

score 2 · Answer 1 · answered Nov 08 '13 at 00:08

You could use gregexpr also, which founds every occurrence of given pattern and outputs starting positions of every match.

messages <- c("wordD wordA wordX wordA wordC wordA wordB wordY",
              "wordX wordA wordY wordY wordC wordD wordB wordY",
              "wordB wordA wordX wordA wordB wordA wordB wordY")
score <- sapply(gregexpr("wordA|wordB", messages), length) - 
            sapply(gregexpr("wordX|wordY", messages), length)

Jota · Accepted Answer · 2013-11-07T05:35:27.887

I'm not entirely sure If this is what you're looking for, but here is what I thought you might be asking. You want to score each element of a vector of sentences or phrases (e.g. mess<-c("some stuff here", "some stuff not here", "most stuff here") according to which words are present. The presence of some words adds +1 to the score, and the presence of other words adds -1 to the score. In my example the words that add +1 are "here" and "stuff" and the words that add -1 are "some" and "most".

# vector  
mess <- c("some stuff here", "some stuff not here", "most stuff here")

positiveword <- lapply(strsplit(mess," "), function(x)grepl("here|stuff",x))
positiveword <- lapply(positiveword, sum)

negativeword <- lapply(strsplit(mess," "), function(x)grepl("some|most",x))
negativeword <- lapply(negativeword, sum)
score <- unlist(positiveword) - unlist(negativeword)

string matching in R

2 Answers2