stemming ngrams in tidyr

Question

I am trying to create bigrams with both words stemmed. But my code is only stemming the second word, leaving the first word unstemmed. So, for example, "worrying about" and "worry about" are listed separately.

Any assistance would be appreciated.

 bigram_text <- text_df %>% 
   mutate_all(as.character) %>%
   unnest_tokens(bigram, text, token = "ngrams", n = 2)%>% 
   mutate(bigram = wordStem(bigram))

 bigramcount<- bigram_text %>%
   count(bigram, sort = TRUE)

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Be sure to explicitly list all the packages you are using so it's clear where each function comes from. — MrFlick, May 18 '20 at 04:59

score 0 · Accepted Answer · answered May 18 '20 at 10:28

The problem you face is that wordStem and a lot of other stemmers only stem words. You want to stem a bigram wich is 2 words. What you need is to use a specific function that can stem sentences. In this case you can use a function from the package textstem called stem_strings.

library(textstem)


 bigram_text <- text_df %>% 
   mutate_all(as.character) %>%
   unnest_tokens(bigram, text, token = "ngrams", n = 2)%>% 
   mutate(bigram = stem_strings(bigram))

Of course a more roundabout way would be to split the bigram into 2 columns, stem the columns and then paste them back together.

stemming ngrams in tidyr

1 Answers1