Im fairly new to text analytics in R and I am trying to use stemCompletion.
Here's what I did at first:
#Clean Corpus
# 1. Stripping any extra white space:
corpus <- tm_map(corpus, stripWhitespace)
# 2. Transforming everything to lowercase
corpus <- tm_map(corpus, content_transformer(tolower))
# 3. Removing numbers
corpus <- tm_map(corpus, removeNumbers)
# 4. Removing punctuation
corpus <- tm_map(corpus, removePunctuation, preserve_intra_word_contractions=FALSE)
# 5. Removing stop words
corpus <- tm_map(corpus, removeWords, stopwords("english"))
# 6. Stem words
corpusStem <- tm_map(corpus, stemDocument, language="english")
I then ran this line for stemCompletion and it didnt actually do anything:
corpusStem <- tm_map(corpusStem, stemCompletion, dictionary=corpus, type="shortest")
I read up on stemCompletion and learned that it needs to be done on each individual word. I saw this code on another thread SOF?48022087:
stemCompletion_mod <- function(x,dict=dictCorpus) {
PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),dictionary=dict, type="shortest"),sep="", collapse=" ")))
}'
I edited the above with my corpus names, but, when I ran the stemCompletion_mod, I got an error:
stemCompletion_mod(corpusStem,corpus)
Error in grep(sprintf("^%s", w), dictionary, value = TRUE) : invalid regular expression, reason 'Missing ')''
What is causing this error? (I also posted on the original thread where I found that code, but its quite old, so seeing if anyone else has some insight here!)
Thanks so much!
Here is the CSV that threw the error.
structure(list(Type = c("Example 1", "Example 2"), Comment = c("This is an example for a corpus. Words like business and charge are not stemming correctly.",
"Here is another example. Challenge and always also need to have stemCompletion."
)), class = "data.frame", row.names = c(NA, -2L))