Avoiding specific words in word stemming with tm package

Question

A previous post addressed this issue here: Text-mining with the tm-package - word stemming

However I am still running into challenges with the tm package.

My goal is to stem a large corpus of words, however I wish to avoid stemming specific words.

For instance, in the corpus I am looking to stem words to their root form of "indian" (stemmed from "indians", "indianspeak", "indianss", etc). However, stemming also transforms words such as "Indianapolis", and "Indiana" to indian, which I do not want.

The post mentioned above addresses this challenge by substituting unique identifiers for specific words in the corpus, stemming it, and then re-substituting the unique identifiers with the actual words. The approach makes sense, however I am still encountering problems with the meta data when the stemming transformation is applied to the corpus. After doing research, I am finding that tm package v0.6 made it so that you can't operate on simple character values (R-Project no applicable method for 'meta' applied to an object of class "character")

However, the solutions posted are not solving the errors I am encountering.

Starting from the solution in the first link posted, I am still running into errors from step 5:

# Step 5: reverse -> sub the identifier keys with the words you want to retain

corpus.temp[seq_len(length(corpus.temp))] <- lapply(corpus.temp, mgsub, pattern=replace, replacement=retain)

Error in UseMethod("meta", x) : no applicable method for 'meta' applied to an object of class "character"

In order to move forward with my larger more complex corpus, I would like to understand why this is happening, and if there is a solution.

Avoiding specific words in word stemming with tm package

0 Answers0