Given a corpus of text, want to use tm (Text Mining) package in R for word stemming and stem-completion to normalize the terms, however, stemCompletion step has issues in 0.6.x version of the package. Using R 3.3.1 with tm 0.6-2.
This question has been asked before but have not seen a complete answer that actually works. Here is the complete code to properly demonstrate the issue.
require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
"e.g., stemming, stopword removal, et cetera.",
"In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus
# *Removing common word endings* (e.g., "ing", "es")
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")
# Next, we remove all the empty spaces generated by isolating the
# word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)
Here is the output:
<<TermDocumentMatrix (terms: 19, documents: 2)>>
Non-/sparse entries: 20/18
Sparsity : 47%
Maximal term length: 9
Weighting : term frequency (tf)
[1] "all" "cetera" "concept" "corpus" "document"
[6] "function" "have" "into" "modifi" "onc"
[11] "remov" "stem" "stopword" "subsum" "the"
[16] "this" "transform" "typic" "want"
Several of the terms have been stemmed: "modifi", "remov", "subsum", "typic", and "onc".
Next, want to complete the stemming.
myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)
At this stage, the corpus is no longer a TextDocument and creating TermDocumentMatrix fails with the error: inherits(doc, "TextDocument") is not TRUE. It has been documented to apply PlainTextDocument()
function next.
myCorpus <- tm_map(myCorpus, PlainTextDocument)
tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)
Here is the output:
<TermDocumentMatrix (terms: 2, documents: 2)>>
Non-/sparse entries: 4/0
Sparsity : 0%
Maximal term length: 7
Weighting : term frequency (tf)
[1] "content" "meta"
Calling PlainTextDocument
has corrupted the corpus.
Expect the stemmed words to be completed: e.g. "modifi" => "modifier", "onc" => "once", etc.