how to subset in r for this particular condition?

Question

df1 and df2 have columns a,b. I want to subset data from df1 such that each entry in df1$a along with df1$b is in df2$a along with df2$b.

df1
a   b  c
1   m  df1
2   f  df1
3   f  df1
4   m  df1
5   f  df1
6   m  df1

df2
a   b  c
1   m  df2
3   f  df2
4   f  df2
5   m  df2
6   f  df2
7   m  df2

desired output

df
a   b  c
1   m  df1
3   f  df1

i am using :

df <- subset(df1,(df1$a%in%df2$a & df1$b%in%df2$b))

but this is giving results similar to

df <-subset(df1,df1$a%in%df2$a)

I have changed the question. Please read it again, and this method is also giving the same result as one condition. — vk087, Feb 05 '15 at 13:10
So maybe `df1[(df1$a %in% df2$a) & (df1$b %in% df2$b), ]` then? — David Arenburg, Feb 05 '15 at 13:13
Please add a reproducible example, comtaining the outpout you get ant=d the output you expect. Plese see http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example for how to make an good reproducible example. — Rainer, Feb 05 '15 at 13:13
No david, it is yielding result similar to df <-subset(df1,df1$a%in%df2$a). However i have changed my Question once again, as i am also confused on it. Now it is giving clearer picture of the question. — vk087, Feb 05 '15 at 13:19
You can't just edit the question each time you are getting a working solution. — David Arenburg, Feb 05 '15 at 13:50
@DavidArenburg I am sorry, I am new here. i am still learning how to quote a question. Anyways, lesson learnt. Will try to avoid these silly mistakes. — vk087, Feb 05 '15 at 13:51

Cath · Accepted Answer · 2015-02-05T14:06:31.630

4

You can use package dplyr:

library(dplyr)
intersect(df1,df2)
#  a b
#1 1 m
#2 3 f

Edit for the new data.frames with c column: you can use function semi_join (also from dplyr):

semi_join(df1,df2,by=c("a","b"))
#  a b   c
#1 1 m df1
#2 3 f df1

Other option, in base R:
you can paste your a and b variables to subset your data.frame:

df1[paste(df1$a,df1$b) %in% paste(df2$a,df2$b), ]
#  a b
#1 1 m
#3 3 f

and with the new data.frames:

   #   a b   c
   # 1 1 m df1
   # 3 3 f df1

edited Feb 05 '15 at 14:06

answered Feb 05 '15 at 13:37

Cath

23,906
5
52
86

I am not targeting the use of paste as it is increasing the run time. Anyother method ?? – vk087 Feb 05 '15 at 13:38
@VaibhavKaushal yes, David's one ;-) or with package `dplyr`, see my edit – Cath Feb 05 '15 at 13:39
1

I was turning around intersect base R, but dplyr overload is nice :) – Colonel Beauvel Feb 05 '15 at 13:44
@ColonelBeauvel, yes I find `dplyr` `setdiff` and `intersect` functions much better (intuitive...) than `base` R ones for data.frames – Cath Feb 05 '15 at 13:45
Oooops, looks like i am not good at framing question. One final edit to the question, @CathG can you help ? – vk087 Feb 05 '15 at 13:47
@VaibhavKaushal, that's why it is better to post a reproducible example from the start!... so my `base` R sol is back in the game ;-) – Cath Feb 05 '15 at 13:48
@CathG Yep, I can use it. Thanks for the input dude, but I am looking for anything other than paste. It is increasing the run time effectively in my original data's case :( – vk087 Feb 05 '15 at 13:50
1

@DavidArenburg, thanks, it is indeed not that appropriate ;-) – Cath Feb 05 '15 at 14:01
@VaibhavKaushal, see my edit, you can go with `semi_join` function from `dplyr` – Cath Feb 05 '15 at 14:09

David Arenburg · Answer 2 · 2015-02-05T14:15:52.580

Or you could do

Res <- rbind(df1, df2) 
Res[duplicated(Res), ]
#   a b
# 7 1 m
# 8 3 f

Edit1: Per the edit, here's a similar data.table solution

library(data.table)
Res <- rbind(df1, df2)
setDT(Res)[duplicated(Res, by = c("a", "b"), fromLast = TRUE)]
#    a b   c
# 1: 1 m df1
# 2: 3 f df1

Edit2: I see that @CathG opened a join battlefront, so here's how we do it with data.table

setkey(setDT(df1), a, b) ; setkey(setDT(df2), a, b)
df1[df2, nomatch = 0]
#    a b   c i.c
# 1: 1 m df1 df2
# 2: 3 f df1 df2

how to subset in r for this particular condition?

2 Answers2