0

I have two dataframes and would like to filter the animals column in table1 by the animals column in table2, while retaining the multiple categories of animals in table1 for cat and dog. The end result should appear the same as table1, but with "lion" removed. There should still be two "cat" and two "dog".

I'm not sure how to approach this as an r newbie. I have a feeling the answer involves the match function or some type of join? I would prefer a dplyr or reshape2 method if possible, especially if there's a way to use a dplyr join function. I'm also not very experienced with the merge base function.

Here is the code for the two data frames:

table1 <- data.frame(id=c(1:7), animal=c("cat","cat","dog","dog","parakeet","lion","duck"))

table2 <- data.frame(id=c(1:4), animal=c("cat","dog","parakeet","duck"))
alistaire
  • 42,459
  • 4
  • 77
  • 117
Mike
  • 2,017
  • 6
  • 26
  • 53

3 Answers3

4

you can simply use %in% like this :

table1[table1$animal %in% table2$animal,]

  id   animal
1  1      cat
2  2      cat
3  3      dog
4  4      dog
5  5 parakeet
7  7     duck
HubertL
  • 19,246
  • 3
  • 32
  • 51
1

Using data.table

library(data.table)
setDT(table1)[table2[-1], on = "animal"]
#   id   animal
#1:  1      cat
#2:  2      cat
#3:  3      dog
#4:  4      dog
#5:  5 parakeet
#6:  7     duck
akrun
  • 874,273
  • 37
  • 540
  • 662
0

You could do this with semi_join in dplyr, which will

      return all rows from ‘x’ where there are matching
      values in ‘y’, keeping just columns from ‘x’.

      A semi join differs from an inner join because an inner join
      will return one row of ‘x’ for each matching row of ‘y’,
      where a semi join will never duplicate rows of ‘x’.

But first, convert your data so the columns that look like strings (but are actually factors) are actually strings. You can do this by table1[] <- lapply(table1, as.character) and table2[] <- lapply(table2, as.character). Or, when you create a data.frame, by

table1 <- data.frame(id=c(1:7), animal=c("cat","cat","dog","dog","parakeet","lion","duck"),
                     stringsAsFactors=FALSE)
table2 <- data.frame(id=c(1:4), animal=c("cat","dog","parakeet","duck"),
                     stringsAsFactors=FALSE)

Then, you can do

library(dplyr)
semi_join(table1, table2, by = "animal")

giving

  id   animal
1  1      cat
2  2      cat
3  3      dog
4  4      dog
5  5 parakeet
6  7     duck

If you don't do this (i.e., if you do join on a factor) the code will give a warning because table1 and table2 have factors, not strings. This warning shouldn't be ignored because on some versions of dplyr the coercion to characters is inconsistent. Before using *join functions in dplyr, convert your data.frame factor columns to characters.

ps

You could also use %in within filter to give the same result table1 %>% filter(animal %in% table2$animal)

jaimedash
  • 2,683
  • 17
  • 30
  • `semi_join` gives me inconsistent results here (i.e. actually a different result when running the code a second time), but never the desired ones that it seems like it should...buggy, it seems. – alistaire Apr 14 '16 at 21:42
  • Inconsistent between invocations on the same data or ...? – jaimedash Apr 14 '16 at 21:44
  • Yep. I updated a couple other packages in between, but that's the only change. There's some sort of issue here. – alistaire Apr 14 '16 at 21:45
  • Hmm, looks like inconsistent factor leveling. Converting to factor manually before the join fixes the issue...but take heed of that warning. – alistaire Apr 14 '16 at 21:47
  • Say, if you do have a reproducible example of inconsistent results, possibly worth filing an issue. It seems dplyr folks are trying to get consistent behavior despite the warning https://github.com/hadley/dplyr/issues/594 – jaimedash Apr 14 '16 at 22:20
  • Thanks, I ended up using the dplyr filter() method. I don't fully understand the issue with the semi_join. I had actually tried that before asking the question and it was not giving me what I wanted. It was only outputting "cat","cat","dog","dog". So what exactly should I do to make it work? Do I convert the factors to strings? – Mike Apr 15 '16 at 14:30
  • Yes, if you create (or read in using `read.table`) the data.frame, pass the argument `stringsAsFactors=FALSE` as in the answer or [follow this anwer](https://stackoverflow.com/questions/2851015/convert-data-frame-columns-from-factors-to-characters?rq=1) to change an existing column. Also, the inconsistency you experienced is a maybe a bug. Another commenter filed this issue: https://github.com/hadley/dplyr/issues/1773 – jaimedash Apr 15 '16 at 15:19