10

I have the need to merge data sets by row but they have differing columns. How can I easily get R to merge the rows, add missing columns and fill in the missing columns with NAs? Currently I would do it like this (very time consuming for multiple merges):

Creating fake data...

x1<-LETTERS[1:3]
x2<-letters[1:3]
x3<-rnorm(3)
x4<-rnorm(3)
x5<-rnorm(3)

Example of multiple data.frames with some similar columns, some different...

data.frame(x1,x2,x3,x4,x5)
data.frame(x1,x3,x4,x5)
data.frame(x2,x3,x4,x5)
data.frame(x1,x2,x3,x4,x5)

How I merge it now...

DF<-data.frame(rbind(data.frame(x1,x2,x3,x4,x5),
data.frame(x1,x2,x3,x4,x5),
data.frame("x2"=rep(NA,3),data.frame(x1,x3,x4,x5)),
data.frame("x1"=rep(NA,3),data.frame(x2,x3,x4,x5))))

DF

EDIT: I tried the suggested code as follows:

l <- list(data.frame(x1,x2,x3,x4,x5),
          data.frame(x1,x3,x4,x5),
          data.frame(x2,x3,x4,x5),
          data.frame(x1,x2,x3,x4,x5))

merger <- function(l) lapply(2:length(l), function(x) merge(l[[x-1]], l[[x]], all=TRUE)) 
while (length(l) != 1) l<-merger(l) 

l

Which yields:

[[1]]
  x1       x3      x4        x5 x2
1  A  0.25492 0.30160  0.259287  a
2  B -0.25937 0.45936 -0.075415  b
3  C -0.53493 1.18316  0.627335  c

Not:

> DF
     x1   x2       x3      x4        x5
1     A    a  0.25492 0.30160  0.259287
2     B    b -0.25937 0.45936 -0.075415
3     C    c -0.53493 1.18316  0.627335
4     A    a  0.25492 0.30160  0.259287
5     B    b -0.25937 0.45936 -0.075415
6     C    c -0.53493 1.18316  0.627335
7     A <NA>  0.25492 0.30160  0.259287
8     B <NA> -0.25937 0.45936 -0.075415
9     C <NA> -0.53493 1.18316  0.627335
10 <NA>    a  0.25492 0.30160  0.259287
11 <NA>    b -0.25937 0.45936 -0.075415
12 <NA>    c -0.53493 1.18316  0.627335

EDIT 2: Sorry to extend my original post but my low rep will not allow me to answer my own question.

Combining Jaron and daroczig's responses results in exactly what I want. I don't want to assign each data frame to an object, so combining them as a list and then using rbind fill works very nicely (see code below)

Thank you to both of you!

x1<-LETTERS[1:3] 
x2<-letters[1:3] 
x3<-rnorm(3) 
x4<-rnorm(3) 
x5<-rnorm(3)

DFlist<-list(data.frame(x1,x2,x3,x4,x5), 
             data.frame(x1,x3,x4,x5),
             data.frame(x2,x3,x4,x5), 
             data.frame(x1,x2,x3,x4,x5))

rbind.fill(DFlist) 
Tyler Rinker
  • 108,132
  • 65
  • 322
  • 519

3 Answers3

15

I had to read your question quite a few times before I understood what you were looking for, but maybe you want rbind.fill from plyr:

d1 <- data.frame(x1,x2,x3,x4,x5)
d2 <- data.frame(x1,x3,x4,x5)
d3 <- data.frame(x2,x3,x4,x5)
d4 <- data.frame(x1,x2,x3,x4,x5)

> rbind.fill(d1,d4,d2,d3)
     x1   x2        x3         x4         x5
1     A    a 1.1216923  0.9236393  0.2749292
2     B    b 1.1913278  1.1145664 -0.5070576
3     C    c 0.2837657 -0.6631544 -1.0675885
4     A    a 1.1216923  0.9236393  0.2749292
5     B    b 1.1913278  1.1145664 -0.5070576
6     C    c 0.2837657 -0.6631544 -1.0675885
7     A <NA> 1.1216923  0.9236393  0.2749292
8     B <NA> 1.1913278  1.1145664 -0.5070576
9     C <NA> 0.2837657 -0.6631544 -1.0675885
10 <NA>    a 1.1216923  0.9236393  0.2749292
11 <NA>    b 1.1913278  1.1145664 -0.5070576
12 <NA>    c 0.2837657 -0.6631544 -1.0675885
joran
  • 169,992
  • 32
  • 429
  • 468
  • Err, the checkmark should go here I think (+1), as I've read about `rbind.fill`. – daroczig Oct 26 '11 at 06:17
  • @darozig I thought you could check them both but alas you can not. Both answers gave me what I wanted, but of the two this one was more correct. – Tyler Rinker Oct 26 '11 at 10:54
2

Using data.table::rbindlist with fill = TRUE option:

data.table::rbindlist(
  list(data.frame(x1,x2,x3,x4,x5), 
       data.frame(x1,x3,x4,x5),
       data.frame(x2,x3,x4,x5), 
       data.frame(x1,x2,x3,x4,x5)),
  fill = TRUE)
Community
  • 1
  • 1
zx8754
  • 52,746
  • 12
  • 114
  • 209
1

Let us say you have your data frames in a nice list:

l <- list(
    data.frame(x2=rnorm(3),x1=rnorm(3)),
    data.frame(x1=rnorm(3),x2=rnorm(3),x3=rnorm(3),x4=rnorm(3),x5=rnorm(3)),
    data.frame(x5=rnorm(3),x2=rnorm(3),x3=rnorm(3),x4=rnorm(3),x1=rnorm(3)),
    data.frame(x5=rnorm(3),x2=rnorm(3),x3=rnorm(3),x4=rnorm(3)),
    data.frame(x2=rnorm(3),x1=rnorm(3),x3=rnorm(3),x4=rnorm(3))
)

Grab the first one and (as @joran suggested) merge all the rest to that with eg. a lucid loop:

r <- l[[1]]
for (i in 2:length(l)) {
    r <- merge(r, l[[i]], all=TRUE)
}

Check out r:

> r
         x2        x3       x4       x1        x5
1  -1.72436 -0.774652  3.10001  0.23249 -1.278216
2  -1.25640        NA       NA  0.32997        NA
3  -1.00652 -0.946254  1.17313       NA  2.014517
4  -0.53770 -0.466626 -0.63369 -1.48375 -1.135515
5  -0.49787        NA       NA -0.34020        NA
6  -0.49704 -0.054175  0.85477       NA  0.831706
7   0.13027  0.421750 -0.18126 -0.65452  0.476576
8   0.18519 -1.006994  0.15141  0.66808        NA
9   0.33954 -0.224478  1.38596       NA  0.145807
10  0.57782  1.126430 -0.89582  0.80199        NA
11  0.59149 -0.447669  0.74855 -1.65790  0.059767
12  0.61374  0.751528 -1.93715  0.40125 -0.148243
13  0.89399  0.758481 -0.94801  0.05084        NA
14  0.94200        NA       NA  0.24945        NA
15  0.99509  0.586097 -0.91455 -0.49909  0.823696

I didn't like that loop, so wrote some recursive stuff:

> merger <- function(l) lapply(2:length(l), function(x) merge(l[[x-1]], l[[x]], all=TRUE))
> while (length(l) != 1) l<-merger(l)
> l
[[1]]
         x2       x1        x3       x4        x5
1  -1.72436  0.23249 -0.774652  3.10001 -1.278216
2  -1.25640  0.32997        NA       NA        NA
3  -1.00652       NA -0.946254  1.17313  2.014517
4  -0.53770 -1.48375 -0.466626 -0.63369 -1.135515
5  -0.49787 -0.34020        NA       NA        NA
6  -0.49704       NA -0.054175  0.85477  0.831706
7   0.13027 -0.65452  0.421750 -0.18126  0.476576
8   0.18519  0.66808 -1.006994  0.15141        NA
9   0.33954       NA -0.224478  1.38596  0.145807
10  0.57782  0.80199  1.126430 -0.89582        NA
11  0.59149 -1.65790 -0.447669  0.74855  0.059767
12  0.61374  0.40125  0.751528 -1.93715 -0.148243
13  0.89399  0.05084  0.758481 -0.94801        NA
14  0.94200  0.24945        NA       NA        NA
15  0.99509 -0.49909  0.586097 -0.91455  0.823696
daroczig
  • 28,004
  • 7
  • 90
  • 124
  • I think the OP wants to `rbind` ("merge" vertically) rather than `merge` ("merge" horizontally) ... – Ben Bolker Oct 25 '11 at 23:50
  • @Ben Bolker: I think I did not do any merging "horizontally". But I might misunderstand something :( – daroczig Oct 25 '11 at 23:57
  • Perhaps I should have said "by row rather than by column". I put "merge" in quotation marks because in R and in general database jargon (I think) it almost always means by column – Ben Bolker Oct 26 '11 at 00:04
  • PS I think something like `Reduce(l,merge)` can do what you've done ... – Ben Bolker Oct 26 '11 at 00:05
  • @daroczig I combined your code with jarons to do exactly what I wanted. Thank you. – Tyler Rinker Oct 26 '11 at 02:26
  • @BenBolker: yes, `Reduce()` sounds fine for this task :) Thanks for pointing my attention to it, I have never used it before. BTW: I still think that a simple `rbind` or `rbind.fill` is not the right tool here, as if we are changing the order of columns (with different classes) it could result in a messed up data frame. But of course all this depends on what the OP wants :) – daroczig Oct 26 '11 at 05:55