Why are these stop words not being removed from my data?

Question

Tokenization of the data

tidy_text <- data %>% 
  unnest_tokens(word, q_content)

Removal of stop words

data("stop_words")
stop_words
tidy_text <- tidy_text %>% anti_join(stop_words, by ="word")
tidy_text %>% count(word, sort = TRUE)

Output including most important 10 words

1                                                                                   im 13012
2                                                                                 dont 11197
3                                                                                 feel  9168
4                                                                                 time  6697
5                                                                                 life  4464
6                                                                                  ive  4403
7                                                                               people  4233
8                                                                                 told  4150
9                                                                              friends  4045
10                                                                                love  3281

It's easier to help you if you include a simple [reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with sample input and desired output that can be used to test and verify possible solutions. Which words do you expect to be removed? — MrFlick, Apr 16 '21 at 00:57
I'm not sure what you're expecting @ScotGarrison. Have you taken a look at `stop_words`? From the 10 words you list, `stop_words` contains `"i'm"`, `"don't"`, `"i've"`. Since you do an exact anti-join and in your word list these stop words are misspelled, they don't get filtered out. So your options are to either add these misspelled words to the list of stop words, or do a fuzzy anti join (e.g. using functions from the `fuzzyjoin` package). — Maurits Evers, Apr 16 '21 at 06:52

score 0 · Answer 1 · answered Apr 16 '21 at 07:47

As explained by @Maurits Evers, the words in your data and stop_words do not exactly match. You may remove ' from the words in stop_words before joining them. Try :

library(dplyr)

tidy_text <- tidy_text %>% 
              anti_join(stop_words %>%
                          mutate(word = gsub("'", "", word)), by ="word")

tidy_text %>% count(word, sort = TRUE)

Why are these stop words not being removed from my data?

1 Answers1