R - merge by variable column with duplicated entry

Question

I am trying to merge two data of different size by ID. However, for the values that match, both data contain duplicated entries, i.e., there may be three ID #3 in Data A and three ID#3 in Data B. When I try to merge the data, the result is much larger than both data combined.

C<-merge(A,B,by="ID",all.x=T,sort=F)

I want to merge the two data by the ID column, such that the first ID #3 in B pairs with the first ID #3 in A, and so on.

Also, I want the row order of Data A to remain the same. The sort=FALSE wasn't much helpful: It places all the matching rows at the top, and the unmatched rows at the bottom.

Thanks for your help!

[How to make a reproducible example?](http://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) — Arun, Feb 02 '13 at 00:06

Josh O'Brien · Accepted Answer · 2013-02-02T01:20:20.677

5

Before merging, you'll need to add to each data.frame a column whose value records the index of each observation within its own ID group.

## Example data
A <- data.frame(ID=c(1,1,1,2), ht=1:4)
B <- data.frame(ID=c(1,1,2,2), wt=3:6)

## Add column with number of each observation within ID
A <- transform(A, ID2=ave(ID, ID, FUN=seq_along))
B <- transform(B, ID2=ave(ID, ID, FUN=seq_along))

## Now carry out the merge
merge(A, B, all.x=TRUE, sort=FALSE)
#   ID ID2 ht wt
# 1  1   1  1  3
# 2  1   2  2  4
# 3  2   1  4  5
# 4  1   3  3 NA

edited Feb 02 '13 at 01:20

answered Feb 02 '13 at 01:12

Josh O'Brien

159,210
26
366
455

The order of the rows are still not preserved, but it matches the duplicated entry as I wanted! Thanks! – shirleywu Feb 02 '13 at 07:57
@shirleywu -- To preserve the order of the rows, just leave off the `sort=FALSE` bit, like this: `merge(A, B, all.x=TRUE)`. – Josh O'Brien Feb 02 '13 at 15:38
Without the `sort=FALSE` bit, the merge automatically assumes `sort=TRUE` and sort the rows from smallest to largest number. I will try to figure this out ^^ – shirleywu Feb 05 '13 at 07:42
Not sure what order you actually want, but it sounds like this might be it: `merge(A[1:2], merge(A, B, all.x=TRUE, sort=FALSE))`. Does that look right? – Josh O'Brien Feb 05 '13 at 07:48
Using the logic of adding a column of index, I added a column of order to A and now the data sort according to A :) – shirleywu Feb 05 '13 at 08:51
@shirleywu -- Great. That was my other idea too. Out of curiosity, did you find any situations in which my last suggestion above *didn't* work just as well as adding an order column? Cheers. – Josh O'Brien Feb 05 '13 at 14:56
It orders the data in some strange way that I can't describe, and adds about 230 new rows. – shirleywu Feb 05 '13 at 19:50

score 2 · Answer 2 · answered Feb 05 '13 at 08:50

Thanks for your help, it is really useful. I end up adding a column of numbers to the larger data that I want to preserve order of.

Using @Josh O'Brien's example,

> ## Example data
> A <- data.frame(ID=c(1,1,1,2), ht=1:4)
> B <- data.frame(ID=c(1,1,2,2), wt=3:6)
> 
> ## Add column with number of each observation within ID
> A <- transform(A, ID2=ave(ID, ID, FUN=seq_along))
> B <- transform(B, ID2=ave(ID, ID, FUN=seq_along))
> 
> # Add a new column in A that numbers the row from 1 to number of row
> A$ORDER_DATA <- 1:nrow(A) 
> 
> ## Now carry out the merge
> C<-merge(A, B, all.x=TRUE, sort=FALSE)
> 
> # Sort the merged data by ORDER_DATA column
> D<-C[with(C,order(ORDER_DATA)),]
> D
  ID ID2 ht ORDER_DATA wt
1  1   1  1          1  3
2  1   2  2          2  4
4  1   3  3          3 NA
3  2   1  4          4  5

R - merge by variable column with duplicated entry

2 Answers2