3

Using tidytext, I have this code:

data(stop_words)
tidy_documents <- tidy_documents %>%
      anti_join(stop_words)

I want it to use the stop words built into the package to write a dataframe called tidy_documents into a dataframe of the same name, but with the words removed if they are in stop_words.

I get this error:

Error: No common variables. Please specify by param. Traceback:

1. tidy_documents %>% anti_join(stop_words)
2. withVisible(eval(quote(`_fseq`(`_lhs`)), env, env))
3. eval(quote(`_fseq`(`_lhs`)), env, env)
4. eval(expr, envir, enclos)
5. `_fseq`(`_lhs`)
6. freduce(value, `_function_list`)
7. withVisible(function_list[[k]](value))
8. function_list[[k]](value)
9. anti_join(., stop_words)
10. anti_join.tbl_df(., stop_words)
11. common_by(by, x, y)
12. stop("No common variables. Please specify `by` param.", call. = FALSE)
Axeman
  • 32,068
  • 8
  • 81
  • 94
Simon Lindgren
  • 2,011
  • 12
  • 32
  • 46
  • Clearly `tidy_documents` and `stop_words` don't share any variable names, so you'll need to match the two dataset using the `by` parameter. – Axeman Apr 16 '17 at 21:16
  • The column of `stop_words` is called `word`, so either name your column that or use the `by` parameter of `anti_join`. – alistaire Apr 16 '17 at 21:17
  • What are the column names in `tidy_documents`? We can tell you specifically how to set up the join if you share that. – Julia Silge Apr 17 '17 at 02:32
  • @JuliaSilge Columns in `tidy_documents` are `author; date; word´. – Simon Lindgren Apr 17 '17 at 05:55
  • 1
    @textnet Hmmmmm, that seems odd then. If you have a `word` column in your main dataset, I would expect `anti_join()` would know to match it up with the `word` column in the `stop_words` dataset. Can you try to [make a reproducible example](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) with data? – Julia Silge Apr 17 '17 at 15:54
  • @JuliaSilge Thanks, but I think I got it to work. Like this `data(stop_words) tidy_base <- anti_join(tidy_base, stop_words, by="word")`. Seems reasonable? – Simon Lindgren Apr 17 '17 at 19:43

2 Answers2

14

You can use the simpler filter() to avoid using the confusing anti_join() function like this:

tidy_documents <- tidy_documents %>%
  filter(!word %in% stop_words$word)
Axel
  • 3,331
  • 11
  • 35
  • 58
Rohit
  • 392
  • 3
  • 14
13

Both tidy_document and stop_words have a list of words listed under a column named word; however, the columns are inverted: in stop_words, it's the first column, while in your dataset it's the second column. That's why the command is unable to "match" the two columns and compare the words. Try this:

tidy_document <- tidy_document %>% 
      anti_join(stop_words, by = c("word" = "word"))

The by command forces the script to compare the columns that are called word, regardless their position.

Vale Baia
  • 168
  • 1
  • 6