2

I think I ran into a situation that (to the best of my knowledge) is not fully covered by the awesome dplyr library, so I guess it will need a bit more of coding than what I am capable of. I have the following 2 data frames:

df1 =
   Col1 Col2 Col3
1    A    1    X
2    A    1    X
3    B    1    X
4    C    1    X
5    D    1    Y
6    D    1    Z

df2 =

  Col1 Col2 Col3
1    A    2    X
2    B    2    Y
3    C    2    Y
4    G    2    Z
5    H    2    X
6    I    2    Z

I want only the rows which has common elements from Col1 only, this is:

out =
  Col1 Col2 Col3
1    A    1    X
2    A    1    X
3    B    1    X
4    C    1    X
5    A    2    X
6    B    2    Y
7    C    2    Y

It looks like dplyr::intersect would do it, but since Col2 and Col3 have different values, it gives me table with 0 values. Your guidance is much appreciated. Thanks. P. Perez.

plperez
  • 35
  • 4

1 Answers1

1

With base R you could do:

common <- intersect(df1$Col1, df2$Col1)

df3 <- rbind(df1, df2)
df3[df3$Col1 %in% common, ]

which gives:

   Col1 Col2 Col3
1     A    1    X
2     A    1    X
3     B    1    X
4     C    1    X
11    A    2    X
21    B    2    Y
31    C    2    Y

And with dplyr:

bind_rows(df1, df2) %>% 
  filter(Col1 %in% intersect(df1$Col1, df2$Col1))

which will give you the same output. An alternative by @Frank from the comments:

bind_rows(df1, df2, .id = "id") %>% 
  group_by(Col1) %>% 
  filter(n_distinct(id) == 2L)

The logic behind this is that you bind the two dataframes together and include an id-column simultaniously with the .id-parameter. Then group by the values of Col1 and check how much unique id's there are for each value. The ones with only one unique id, don't appear in both dataframes.

A similar logic can be applied with the data.table package:

library(data.table)
rbindlist(list(df1, df2), idcol = 'id')[, if (uniqueN(id) == 2L) .SD, by = Col1]
Jaap
  • 81,064
  • 34
  • 182
  • 193
  • @Procrastinatus Maximus: the solution with base R is beautiful as simple. I need to take some time to digest the solution with `dplyr` though. Thanks! – plperez Nov 15 '16 at 17:21
  • @plperez added a short explanation – Jaap Nov 15 '16 at 17:25