Calling StemCompletion and PlainTextDocument corrupt text in R

Question

Given a corpus of text, want to use tm (Text Mining) package in R for word stemming and stem-completion to normalize the terms, however, stemCompletion step has issues in 0.6.x version of the package. Using R 3.3.1 with tm 0.6-2.

This question has been asked before but have not seen a complete answer that actually works. Here is the complete code to properly demonstrate the issue.

 require(tm)
 txt <- c("Once we have a corpus we typically want to modify the documents in it",
          "e.g., stemming, stopword removal, et cetera.",
          "In tm, all this functionality is subsumed into the concept of a transformation.")

 myCorpus <- Corpus(VectorSource(txt))

 myCorpus <- tm_map(myCorpus, content_transformer(tolower))
 myCorpus <- tm_map(myCorpus, removePunctuation)
 myCorpusCopy <- myCorpus

 # *Removing common word endings* (e.g., "ing", "es") 
 myCorpus <- tm_map(myCorpus, stemDocument, language = "english")

 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
 myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))

 tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
 print(tdm)
 print(dimnames(tdm)$Terms)

Here is the output:

<<TermDocumentMatrix (terms: 19, documents: 2)>>
Non-/sparse entries: 20/18
Sparsity           : 47%
Maximal term length: 9
Weighting          : term frequency (tf)
 [1] "all"       "cetera"    "concept"   "corpus"    "document" 
 [6] "function"  "have"      "into"      "modifi"    "onc"      
[11] "remov"     "stem"      "stopword"  "subsum"    "the"      
[16] "this"      "transform" "typic"     "want"

Several of the terms have been stemmed: "modifi", "remov", "subsum", "typic", and "onc".

Next, want to complete the stemming.

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

At this stage, the corpus is no longer a TextDocument and creating TermDocumentMatrix fails with the error: inherits(doc, "TextDocument") is not TRUE. It has been documented to apply PlainTextDocument() function next.

myCorpus <- tm_map(myCorpus, PlainTextDocument)

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)

Here is the output:

<TermDocumentMatrix (terms: 2, documents: 2)>>
Non-/sparse entries: 4/0
Sparsity           : 0%
Maximal term length: 7
Weighting          : term frequency (tf)
[1] "content" "meta"

Calling PlainTextDocument has corrupted the corpus.

Expect the stemmed words to be completed: e.g. "modifi" => "modifier", "onc" => "once", etc.

Possible duplicate of [R Warning in stemCompletion and error in TermDocumentMatrix](http://stackoverflow.com/questions/30321770/r-warning-in-stemcompletion-and-error-in-termdocumentmatrix) — Hack-R, Jul 26 '16 at 18:37
As mentioned in the question, this question has been repeated but haven't seen a complete answer and more often the question was not fully self-contained (e.g. loaded a text file). — CodeMonkey, Jul 26 '16 at 18:58
http://stackoverflow.com/questions/25206049/stemcompletion-is-not-working — Hack-R, Jul 26 '16 at 19:08
It was because of the earlier error where you got the 3 warning messages I guess. It works for me with your example when I do it as in the updated answer below. Hope that helps, cheers. — Hack-R, Jul 26 '16 at 19:13
Please see my edit comment regarding the rollback, but I think you should be good to go anyway — Hack-R, Jul 26 '16 at 19:26
Saw comment re: rollback so dropped the comment for the warning but want the tdm as input to other packages so updated question accordingly. Thanks for the effort so far but answer still doesn't quite work. — CodeMonkey, Jul 26 '16 at 19:48

Hack-R · Answer 1 · 2016-07-26T19:09:48.777

Calling PlainTextDocument didn't corrupt the corpus.

You may have noticed that when you ran the line

myCorpus = tm_map(myCorpus, stemCompletion, dictionary=myCorpusCopy)

you got several warning messages:

Warning messages:
1: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used
2: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used
3: In grep(sprintf("^%s", w), dictionary, value = TRUE) :
  argument 'pattern' has length > 1 and only the first element will be used

Those were worth mentioning ;)

This is how to carry out stemming with stem completion using your data:

txt <- c("Once we have a corpus we typically want to modify the documents in it",
         "e.g., stemming, stopword removal, et cetera.",
         "In tm, all this functionality is subsumed into the concept of a transformation.")
myCorpus <- Corpus(VectorSource(txt))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
tdm      <- TermDocumentMatrix(myCorpus, control = list(stemming = TRUE)) 
cbind(stems = rownames(tdm), completed = stemCompletion(rownames(tdm), myCorpus))

          stems       completed       
all       "all"       "all"           
cetera    "cetera"    "cetera"        
concept   "concept"   "concept"       
corpus    "corpus"    "corpus"        
document  "document"  "documents"     
function  "function"  "functionality" 
have      "have"      "have"          
into      "into"      "into"          
modifi    "modifi"    "modify"              
onc       "onc"       "once"          
remov     "remov"     "removal"       
stem      "stem"      "stemming"      
stopword  "stopword"  "stopword"      
subsum    "subsum"    "subsumed"      
the       "the"       "the"           
this      "this"      "this"          
transform "transform" "transformation"
typic     "typic"     "typically"     
want      "want"      "want"

To permanently write the changes back to the TDM:

stemCompletion_mod <- function(x,dict=dictCorpus) {
  PlainTextDocument(stripWhitespace(paste(stemCompletion(unlist(strsplit(as.character(x)," ")),
                                                         dictionary=dict, type="shortest"),sep="", 
                                          collapse=" ")))}

tdm <- stemCompletion_mod(rownames(tdm), myCorpus)  


tdm$content

[1] "all cetera concept corpus documents functionality have into NA once removal stemming stopword subsumed the this transformation typically want"

Useful getting list of stem-completed words but the TermDocumentMatrix still has unstemmed terms at this point and using tdm in wordcloud or other package still has the unstemmed words. — CodeMonkey, Jul 26 '16 at 18:49
@JasonM1 OK I updated it to write the changes back to the TDM. I got the function from here: http://stackoverflow.com/questions/25206049/stemcompletion-is-not-working — Hack-R, Jul 26 '16 at 19:11
Thanks @Hack-R for the answer but if use wordcloud with tdm after steps above then wordcloud still showing unstemmed terms. Also, when I run code above there is an empty string where "modify" is listed for "modifi". Using Using R 3.3.1 with tm 0.6-2 on Windows. — CodeMonkey, Jul 26 '16 at 20:08

score 1 · Answer 2 · answered Jan 07 '17 at 02:36

With respect to Hack-R's solution, I had the same issue as Jason, where I wanted to have the "StemCompleted" words for use in a word cloud, and as part of the TDM.

Since stemCompletion doesn't return a TDM, I extracted the "terms" from the TDM, then ran stemCompletion on that.

(I broke these to a separate variable while I was testing)

require(tm)
txt <- c("Once we have a corpus we typically want to modify the documents in it",
      "e.g., stemming, stopword removal, et cetera.",
      "In tm, all this functionality is subsumed into the concept of a transformation.")

myCorpus <- Corpus(VectorSource(txt))

myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation)
myCorpusCopy <- myCorpus

 # *Removing common word endings* (e.g., "ing", "es") 
myCorpus <- tm_map(myCorpus, stemDocument, language = "english")

 # Next, we remove all the empty spaces generated by isolating the
 # word stems in the previous step.
myCorpus <- tm_map(myCorpus, content_transformer(stripWhitespace))

tdm <- TermDocumentMatrix(myCorpus, control = list(wordLengths = c(3, Inf)))
print(tdm)
print(dimnames(tdm)$Terms)

Giving this output:

 [1] "all"       "cetera"    "concept"   "corpus"    "document" 
 [6] "function"  "have"      "into"      "modifi"    "onc"      
[11] "remov"     "stem"      "stopword"  "subsum"    "the"      
[16] "this"      "transform" "typic"     "want"

Since stemCompletion seems to return a character table, I just replaced the terms portion of 'tdm' with a stemCompleted version:

tdm$dimnames$Terms <- as.character(stemCompletion(tdm$dimnames$Terms, myCorpusCopy, type = "prevalent"))
print(tdm$dimnames$Terms)

This gives me:

 [1] "all"            "cetera"         "concept"        "corpus"        
 [5] "documents"      "functionality"  "have"           "into"          
 [9] ""               "once"           "removal"        "stemming"      
[13] "stopword"       "subsumed"       "the"            "this"          
[17] "transformation" "typically"      "want"

You get blank fields on words it doesn't know what to do with ("modifi"), apparently, but at least this time you can work with the stemCompleted versions...

Calling StemCompletion and PlainTextDocument corrupt text in R

2 Answers2