Stem completion in R replaces names, not data

Question

My team is doing some topic modeling on medium-sized chunks of text (tens of thousands of words), using the Quanteda package in R. I'd like to reduce words to word stems before the topic modeling process, so that I'm not counting variations on the same word as different topics.

Only problem is that the stemming algorithm leaves behind some words that aren't really words. "Happiness" stems to "happi," "arrange" stems to "arrang," and so on. So, before I visualize the results of the topic modeling, I'd like to restore the stems to complete words.

By reading through some previous threads here on StackOverflow, I came across a function, stemCompletion(), from the TM package, that does this, at least approximately. It seems to work reasonably well.

But when I apply it to the terms vector within a document text matrix, stemCompletion() always replaces the names of the character vector, not the characters themselves. Here's a reproducible example:

# Set up libraries
library(janeaustenr)
library(quanteda)
library(tm)

# Get first 200 words of Mansfield Park
words <- head(mansfieldpark, 200)

# Build a corpus from words
corpus <- quanteda::corpus(words)

# Eliminate some words from counting process
STOPWORDS <- c("the", "and", "a", "an")

# Create a document text matrix and do topic modeling
dtm <- corpus %>% 
    quanteda::dfm(remove_punct = TRUE,
                  remove = STOPWORDS) %>%
    quanteda::dfm_wordstem(.) %>% # Word stemming takes place here
    quanteda::convert("topicmodels")

# Word stems are now stored in dtm$dimnames$Terms

# View a sample of stemmed terms
tail(dtm$dimnames$Terms, 20)

# View the structure of dtm$dimnames$Terms (It's just a character vector)
str(dtm$dimnames$Terms)

# Apply tm::stemCompletion to Terms
unstemmed_terms <-
    tm::stemCompletion(dtm$dimnames$Terms, 
                       dictionary = words, # or corpus
                       type = "shortest")

# Result is composed entirely of NAs, with the values stored as names!
str(unstemmed_terms)

tail(unstemmed_terms, 20)

I'm looking for a way to get the results returned by stemCompletion() into a character vector, and not into the names attribute of a character vector. Any insights into this issue are much appreciated.

I haven't used stemCompletion before but unless I missed something, `names(x)` is a character vector containing the names of `x`? Which sounds like what you are looking for? Like this: `unstemmed_terms <- names(unstemmed_terms)` — Calum You, Apr 04 '18 at 22:38
That's simple, but it works. Thanks! I'm still curious as to _why_ stemCompletion stores its result in the names attribute, but this is helpful. — J. Trimarco, Apr 04 '18 at 22:56
`stemCompletion()` does not store the result in the names attribute of the returned character vector. Rather, the names are the term whose stem was completed, and the stem-completed term is the vector element. — Ken Benoit, Apr 05 '18 at 07:23
Also your loading of the **tidyverse** package is unnecessary for this example. — Ken Benoit, Apr 05 '18 at 07:23
@KenBenoit Thanks for pointing out that the returned vector elements are the stem-completed terms and the input stems. The documentation only says this function returns "A character vector with completed words." Very helpful to understand what's really going on. — J. Trimarco, Apr 05 '18 at 23:17

score 4 · Accepted Answer · answered Apr 05 '18 at 07:20

The problem is that your dictionary argument to tm::stemCompletion() is not a character vector of words (or a tm Corpus object), but rather a set of lines from the Austen novel.

tail(words)
# [1] "most liberal-minded sister and aunt in the world."                        
# [2] ""                                                                         
# [3] "When the subject was brought forward again, her views were more fully"    
# [4] "explained; and, in reply to Lady Bertram's calm inquiry of \"Where shall" 
# [5] "the child come to first, sister, to you or to us?\" Sir Thomas heard with"
# [6] "some surprise that it would be totally out of Mrs. Norris's power to"

But this can easily be tokenised using quanteda's tokens(), and converting that to a character vector.

unstemmed_terms <-
    tm::stemCompletion(dtm$dimnames$Terms, 
                       dictionary = as.character(tokens(words, remove_punct = TRUE)), 
                       type = "shortest")

tail(unstemmed_terms, 20)
#      arrang          chariti           perhap         parsonag          convers            happi 
# "arranging"               NA        "perhaps"               NA   "conversation"        "happily" 
#      belief             most     liberal-mind             aunt            again             view 
#    "belief"           "most" "liberal-minded"           "aunt"          "again"          "views" 
#     explain             calm          inquiri            where             come            heard 
# "explained"           "calm"               NA               NA           "come"          "heard" 
#     surpris            total 
#  "surprise"        "totally"

Thanks for this helpful solution. I had tried tokens() but hadn't thought to convert to character as well. — J. Trimarco, Apr 05 '18 at 23:39

Stem completion in R replaces names, not data

1 Answers1