Merge data.frames with duplicates

Question

I have many data.frames, for example:

df1 = data.frame(names=c('a','b','c','c','d'),data1=c(1,2,3,4,5))
df2 = data.frame(names=c('a','e','e','c','c','d'),data2=c(1,2,3,4,5,6))
df3 = data.frame(names=c('c','e'),data3=c(1,2))

and I need to merge these data.frames, without delete the name duplicates

> result
  names data1 data2 data3
1  'a'    1    1      NA
2  'b'    2    NA     NA
3  'c'    3    4      1
4  'c'    4    5      NA
5  'd'    5    6      NA
6  'e'    NA   2      2       
7  'e'    NA   3      NA

I cant find function like merge with option to handle with name duplicates. Thank you for your help. To define my problem. The data comes from biological experiment where one sample have a different number of replicates. I need to merge all experiment, and I need to produce this table. I can't generate unique identifier for replicates.

G. Grothendieck · Accepted Answer · 2012-03-26T13:35:18.323

4

First define a function, run.seq, which provides sequence numbers for duplicates since it appears from the output that what is desired is that the ith duplicate of each name in each component of the merge be associated. Then create a list of the data frames and add a run.seq column to each component. Finally use Reduce to merge them all.

run.seq <- function(x) as.numeric(ave(paste(x), x, FUN = seq_along))

L <- list(df1, df2, df3)
L2 <- lapply(L, function(x) cbind(x, run.seq = run.seq(x$names)))

out <- Reduce(function(...) merge(..., all = TRUE), L2)[-2]

The last line gives:

> out
  names data1 data2 data3
1     a     1     1    NA
2     b     2    NA    NA
3     c     3     4     1
4     c     4     5    NA
5     d     5     6    NA
6     e    NA     2     2
7     e    NA     3    NA

EDIT: Revised run.seq so that input need not be sorted.

edited Mar 26 '12 at 13:35

answered Mar 26 '12 at 00:02

G. Grothendieck

254,981
17
203
341

This solution work properly only for sorted data, but it is ok for me. Thank you very much, you are the greatest. For this example: df1 = data.frame(names=c('a','b','c','d','c','c'),data1=c(1,2,3,4,5,6)) df2 = data.frame(names=c('e','c','c','c'),data2=c(1,2,3,4)) it doesn't work without sorting names. – user1291855 Mar 26 '12 at 07:50
Have revised `run.seq` so that input need not be sorted. – G. Grothendieck Mar 26 '12 at 11:40

score 1 · Answer 2 · edited May 23 '17 at 12:10

1

See other questions:

Examples:

library(reshape)
out <- merge_recurse(L)

or

library(plyr)

out<-join(df1, df2, type="full")
out<-join(out, df3, type="full")
*can be looped

or

library(plyr)
out<-ldply(L)

edited May 23 '17 at 12:10

Community

1
1

answered Mar 26 '12 at 11:37

Etienne Low-Décarie

13,063
17
65
87

score -1 · Answer 3 · answered Mar 25 '12 at 22:44

-1

I think there is just not enough information in your example data frames to do this. Which 'c' in dataframe 1 should be paired with which 'c' in data frame 2? We cannot tell, so R can't either. I suspect you will have to add another variable to each of your dataframes that uniquely identifies these duplicate cases.

answered Mar 25 '12 at 22:44

Marius

58,213
16
107
105

It is not important which 'c' in data frame 1 should be paired with 'c' from data frame 2 (I think, that the first free will be the best, and when all will be paired new row should be created). I know that with duplicate identifiers it is not so easy. – user1291855 Mar 25 '12 at 22:56

Merge data.frames with duplicates

3 Answers3

Linked

Related