0

I have a simple code to perform text analytics. Before creating the DTM, I am applyting stemCompletion. However, the output of this is something which I am not understanding, whether I am doing it wrong, or this is the only way it behaves.

I have referred this link of rmy help: text-mining-with-the-tm-package-word-stemming

The issue that I see here is that after stemming, my DTm shrinks and doesn't return the tokens at all (returns 'content' 'meta')

My code and Outputs:

texts <- c("i am member of the XYZ association",
           "apply for our open associate position", 
           "xyz memorial lecture takes place on wednesday", 
           "vote for the most popular lecturer")

myCorpus <- Corpus(VectorSource(texts))
myCorpus <- tm_map(myCorpus, content_transformer(tolower))
myCorpus <- tm_map(myCorpus, removePunctuation) 
myCorpus <- tm_map(myCorpus, removeNumbers)
removeURL <- function(x) gsub("http[[:alnum:]]*", "", x)
myCorpus <- tm_map(myCorpus, content_transformer(removeURL))  #??
myCorpusCopy <- myCorpus
myCorpus <- tm_map(myCorpus, stemDocument)

for (i in 1:4) {
  cat(paste("[[", i, "]] ", sep = ""))
  writeLines(as.character(myCorpus[[i]]))
}

Output:
  [[1]] i am member of the xyz associ
  [[2]] appli for our open associ posit
  [[3]] xyz memori lectur take place on wednesday
  [[4]] vote for the most popular lectur


myCorpus <- tm_map(myCorpus, stemCompletion, dictionary = myCorpusCopy)
for (i in 1:4) {
  cat(paste("[[", i, "]] ", sep = ""))
  writeLines(as.character(myCorpus[[i]]))
}

Output:
  [[1]] content
  meta
  [[2]] content
  meta
  [[3]] content
  meta
  [[4]] content
  meta

myCorpus <- tm_map(myCorpus, PlainTextDocument)

dtm <- DocumentTermMatrix(myCorpus, control = list(weighting = weightTf))
dtm
inspect(dtm)

Output:
  > inspect(dtm)
  <<DocumentTermMatrix (documents: 4, terms: 2)>>
    Non-/sparse entries: 8/0
  Sparsity           : 0%
  Maximal term length: 7
  Weighting          : term frequency (tf)

  Terms
  Docs           content meta
  character(0)       1    1
  character(0)       1    1
  character(0)       1    1
  character(0)       1    1

Expected output: To successfully run stemming (both stemdocument and stemcompletion). I am using tm 0.6 package

Community
  • 1
  • 1
Hardik Gupta
  • 4,700
  • 9
  • 41
  • 83

1 Answers1

0

You use the function wrong. Here's how it works:

texts <- c("i am member of the XYZ association",
           "apply for our open associate position", 
           "xyz memorial lecture takes place on wednesday", 
           "vote for the most popular lecturer")
corp <- Corpus(VectorSource(texts))
tdm <- TermDocumentMatrix(corp, control = list(stemming = TRUE))
Terms(tdm)
#  [1] "appli"     "associ"    "for"       "lectur"    "member"    "memori"    "most"      "open"     
#  [9] "our"       "place"     "popular"   "posit"     "take"      "the"       "vote"      "wednesday"
# [17] "xyz" 
stemCompletion(Terms(tdm), corp)
# appli      associ         for      lectur      member      memori        most        open 
#    "" "associate"       "for"   "lecture"    "member"  "memorial"      "most"      "open" 
#   our       place     popular       posit        take         the        vote   wednesday 
# "our"     "place"   "popular"  "position"     "takes"       "the"      "vote" "wednesday" 
#   xyz 
# "xyz"
lukeA
  • 53,097
  • 5
  • 97
  • 100
  • Also I want a dtm not tdm, this can be interchanged right? – Hardik Gupta Jan 16 '17 at 11:59
  • No it's "correct" (depending on what you assume to be correct; it's just trying to match things), and yes you can exchange it. – lukeA Jan 16 '17 at 12:07
  • look at the first output, `appli` the stem and stem completion output both are same – Hardik Gupta Jan 16 '17 at 12:14
  • 1
    Technically, they are not the same: one is `"appli"` and the other is `""`. Anyway, R has a wonderful documentation. Have you even considered looking into it? `?stemCompletion` says, the function tries to _"**Heuristically** complete stemmed words."_ You may also want to read the paper, which is referenced there, to explore problems around stemming inversion and possible approaches: https://pdfs.semanticscholar.org/7ef5/4d37940617617c745e8bb9758d06a3e37231.pdf . – lukeA Jan 16 '17 at 12:19