Objective
I would like to count the number of times the word "love" appears in a documents but only if it isn't preceded by the word 'not' e.g. "I love films" would count as one appearance whilst "I do not love films" would not count as an appearance.
Question
How would one proceed using the tm package?
R Code
Below is some self contained code which I would like to modify to do the above.
require(tm)
# text vector
my.docs <- c(" I love the Red Hot Chilli Peppers! They are the most lovely people in the world.",
"I do not love the Red Hot Chilli Peppers but I do not hate them either. I think they are OK.\n",
"I hate the `Red Hot Chilli Peppers`!")
# convert to data.frame
my.docs.df <- data.frame(docs = my.docs, row.names = c("positiveText", "neutralText", "negativeText"), stringsAsFactors = FALSE)
# convert to a corpus
my.corpus <- Corpus(DataframeSource(my.docs.df))
# Some standard preprocessing
my.corpus <- tm_map(my.corpus, stripWhitespace)
my.corpus <- tm_map(my.corpus, tolower)
my.corpus <- tm_map(my.corpus, removePunctuation)
my.corpus <- tm_map(my.corpus, removeWords, stopwords("english"))
my.corpus <- tm_map(my.corpus, stemDocument)
my.corpus <- tm_map(my.corpus, removeNumbers)
# construct dictionary
my.dictionary.terms <- tolower(c("love", "Hate"))
my.dictionary <- Dictionary(my.dictionary.terms)
# construct the term document matrix
my.tdm <- TermDocumentMatrix(my.corpus, control = list(dictionary = my.dictionary))
inspect(my.tdm)
# Terms positiveText neutralText negativeText
# hate 0 1 1
# love 2 1 0
Further information
I am trying to reproduce the dictionary rules functionality from the commercial package WordStat. It is able to make use of dictionary rules i.e.
"hierarchical content analysis dictionaries or taxonomies composed of words, word patterns, phrases as well as proximity rules (such as NEAR, AFTER, BEFORE) for achieving precise measurement of concepts"
Also I noticed this interesting SO question: Open-source rule-based pattern matching / information extraction frameworks?
UPDATE 1: Based on @Ben's comment and post I got this (although slightly different at the end it is strongly inspired by his answer so full credit to him)
require(data.table)
require(RWeka)
# bi-gram tokeniser function
BigramTokenizer <- function(x) NGramTokenizer(x, Weka_control(min = 1, max = 2))
# get all 1-gram and 2-gram word counts
tdm <- TermDocumentMatrix(my.corpus, control = list(tokenize = BigramTokenizer))
# convert to data.table
dt <- as.data.table(as.data.frame(as.matrix(tdm)), keep.rownames=TRUE)
setkey(dt, rn)
# attempt at extracting but includes overlaps i.e. words counted twice
dt[like(rn, "love")]
# rn positiveText neutralText negativeText
# 1: i love 1 0 0
# 2: love 2 1 0
# 3: love peopl 1 0 0
# 4: love the 1 1 0
# 5: most love 1 0 0
# 6: not love 0 1 0
Then I guess I would need to do some row sub-setting and row subtraction which would lead to something like
dt1 <- dt["love"]
# rn positiveText neutralText negativeText
#1: love 2 1 0
dt2 <- dt[like(rn, "love") & like(rn, "not")]
# rn positiveText neutralText negativeText
#1: not love 0 1 0
# somehow do something like
# DT = dt1 - dt2
# but I can't work out how to code that but the require output would be
# rn positiveText neutralText negativeText
#1: love 2 0 0
I don't know how to get that last line using data.table but this approach would be akin to WordStats 'NOT NEAR' dictionary function e.g. in this case only count the word "love" if it deesn't appear within 1-word either directly before or directly after the word 'not'.
If we were to do an m-gram tokeniser then it would be like saying we only count the word "love" if it doesn't appear within (m-1)-words either side of the word "not".
Other approaches are most welcome!