0

I am struggling through some text analysis, and I'm not sure I'm doing the stemming correctly. Right now, my command for single-term stemming is

text_stem <- text_clean %>% mutate(stem = wordStem(word, language = "english"))

Is it possible to use this not only as a stemmer, but as a filter? For example, if "text_clean" contains the word aksdjhgla and that word is not in whatever SnowballC uses as a dictionary, the stemmed text would reject it? Maybe there's another command that does this kind of filtering?

Karl Wolfschtagg
  • 425
  • 2
  • 10
  • There are many different things that could be causing the problems you are having. I would encourage including the actual error you're receiving and making your code reproducible. Refs: [stackoverflow.com/q/5963269](https://stackoverflow.com/q/5963269), [minimal reproducible example](https://stackoverflow.com/help/minimal-reproducible-example), and [stackoverflow.com/tags/r/info](https://stackoverflow.com/tags/r/info) – Kat Jan 09 '22 at 22:50
  • @Kat, I'm not getting an error. The code works as it should (I think). When I look at the top n stems, I'm wondering if I can use the stemmer to reject "garbage" (or misspelled) words. – Karl Wolfschtagg Jan 09 '22 at 23:41
  • 1
    How about this resource? [This is a link to the `hunspell` package content explanation.](https://cran.r-project.org/web/packages/hunspell/vignettes/intro.html) – Kat Jan 10 '22 at 02:02

1 Answers1

1

wordStem does not employ a dictionary but uses grammatical rules to do stemming (which is a rather crude approximation to lemmatisation btw). Here is an example:

words <- c("win", "winning")
words2 <- c("aksdjhglain", "aksdjhglainning")

SnowballC::wordStem(words, language = "english")
#> [1] "win"    "win"
SnowballC::wordStem(words2, language = "english")
#> [1] "aksdjhglain"  "aksdjhglain"

As you can see, wordStem does exactly the same, no matter if the words actually exist or are complete rubbish. All that matters are the word endings (ie stems). As @Kat suggested, you probably want to look at the hunspell package which actually uses dictionaries. To find out which words exist in the dictionary, use hunspell_check:

hunspell::hunspell_check(c(words, words2))
#> [1]  TRUE  TRUE  FALSE FALSE

Inside your existing code, you could use this to remove misspelled words:

text_stem <- text_clean %>% 
  mutate(stem = wordStem(word, language = "english")) %>% 
  filter(hunspell::hunspell_check(word), dict = dictionary("en_US"))
JBGruber
  • 11,727
  • 1
  • 23
  • 45