merge data by groups and by common ID (IDs duplicated outside groups)

Question

This is not a duplicated question to How to join (merge) data frames. You can perform the left.merge inside the group but not to the whole data set. The ids are unique inside group, not acroos group. By not grouping and using a left.merge, you willl mess up the data.

I have a data with many groups (Panel data/Time seriers). Within the group, I want to merge the data by a common ID. And apply the same merge across all the groups that I have(same merge for all other groups).

#sample data
a<-data.frame(c(1:4,1:4),1,c('a','a','a','a','b','b','b','b'))
b<-data.frame(c(2,4,2,4),10,c('a','a','b','b'))

colnames(a)<-c('id','v','group')
colnames(b)<-c('id','v1','group')

> a
  id v group
1  1 1     a
2  2 1     a
3  3 1     a
4  4 1     a
5  1 1     b
6  2 1     b
7  3 1     b
8  4 1     b
> b
  id v1 group
1  2 10     a
2  4 10     a
3  2 10     b
4  4 10     b

I tried to use the dplyr group_by (group) and then merge(a,b,by='id',all.x=TRUE), but not sure how to apply dplyr to two data sets

desired output (left merge)

  id v group.x v1 group.y
  1  1       a NA    <NA>
  2  1       a 10       a
  3  1       a NA    <NA>
  4  1       a 10       a
  1  1       b NA    <NA>
  2  1       b 10       b
  3  1       b NA    <NA>
  4  1       b 10       b

Outside of dplyr you have the option `library(data.table); setDT(b)[a, on="id", mult="first"]` — Frank, Oct 17 '16 at 17:38
`a %>% left_join(b, by=c("id","group"))`. This gives you only one "group" column, but `v1` will be `NA` for cases where there's no matching row in `b`, so creating two separate "group" columns is redundant. Isn't that better, given that `group` (presumably) represents the same underlying division of the data in both data frames? — eipi10, Oct 17 '16 at 17:38
I think you'll have to make an example that covers such cases. It's not clear to me, anyways, what extension you're describing. — Frank, Oct 17 '16 at 17:50

score 2 · Accepted Answer · answered Oct 17 '16 at 17:48

2

You can just include group in the by argument for the join:

a %>% left_join(b, by=c("id","group"))

  id v group v1
1  1 1     a NA
2  2 1     a 10
3  3 1     a NA
4  4 1     a 10
5  1 1     b NA
6  2 1     b 10
7  3 1     b NA
8  4 1     b 10

This gives you only one "group" column, but v1 will be NA for cases where there's no matching row in b, so creating two separate "group" columns is redundant. Isn't that better, given that group (presumably) represents the same underlying division of the data in both data frames?

answered Oct 17 '16 at 17:48

eipi10

91,525
24
209
285

your codes work. It assumes the combination of 'id' by 'group' are unique, right? What if a small portion of 'id' by 'group ' are not unique. – MLE Oct 17 '16 at 17:50
Do you mean that there can be more than one row with same combination of `id` and `group` (but presumably different values of `v`)? As @Frank said, please provide more detail on what you have in mind; preferably by adding example data/code to your question. – eipi10 Oct 17 '16 at 17:52

merge data by groups and by common ID (IDs duplicated outside groups)

1 Answers1