Removing stop words with tidytext

Question

Using tidytext, I have this code:

data(stop_words)
tidy_documents <- tidy_documents %>%
      anti_join(stop_words)

I want it to use the stop words built into the package to write a dataframe called tidy_documents into a dataframe of the same name, but with the words removed if they are in stop_words.

I get this error:

Error: No common variables. Please specify by param. Traceback:

1. tidy_documents %>% anti_join(stop_words)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(expr, envir, enclos)
5. `_fseq`(`_lhs`)
6. freduce(value, `_function_list`)
7. withVisible(function_list[[k]](value))
8. function_list[[k]](value)
9. anti_join(., stop_words)
10. anti_join.tbl_df(., stop_words)
11. common_by(by, x, y)
12. stop("No common variables. Please specify `by` param.", call. = FALSE)

Clearly `tidy_documents` and `stop_words` don't share any variable names, so you'll need to match the two dataset using the `by` parameter. — Axeman, Apr 16 '17 at 21:16
The column of `stop_words` is called `word`, so either name your column that or use the `by` parameter of `anti_join`. — alistaire, Apr 16 '17 at 21:17
What are the column names in `tidy_documents`? We can tell you specifically how to set up the join if you share that. — Julia Silge, Apr 17 '17 at 02:32
@JuliaSilge Columns in `tidy_documents` are `author; date; word´. — Simon Lindgren, Apr 17 '17 at 05:55
@textnet Hmmmmm, that seems odd then. If you have a `word` column in your main dataset, I would expect `anti_join()` would know to match it up with the `word` column in the `stop_words` dataset. Can you try to [make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with data? — Julia Silge, Apr 17 '17 at 15:54
@JuliaSilge Thanks, but I think I got it to work. Like this `data(stop_words) tidy_base <- anti_join(tidy_base, stop_words, by="word")`. Seems reasonable? — Simon Lindgren, Apr 17 '17 at 19:43

score 14 · Answer 1 · edited Oct 19 '17 at 00:27

14

You can use the simpler filter() to avoid using the confusing anti_join() function like this:

tidy_documents <- tidy_documents %>%
  filter(!word %in% stop_words$word)

edited Oct 19 '17 at 00:27

Axel

3,331
11
35
58

answered Oct 19 '17 at 00:09

Rohit

392
3
14

score 13 · Accepted Answer · answered May 14 '17 at 22:24

Both tidy_document and stop_words have a list of words listed under a column named word; however, the columns are inverted: in stop_words, it's the first column, while in your dataset it's the second column. That's why the command is unable to "match" the two columns and compare the words. Try this:

tidy_document <- tidy_document %>% 
      anti_join(stop_words, by = c("word" = "word"))

The by command forces the script to compare the columns that are called word, regardless their position.

Removing stop words with tidytext

2 Answers2