36

I gather data from 4 df's and would like to merge them by rownames. I am looking for an efficient way to do this. This is a simplified version of the data I have.

df1           <- data.frame(N= sample(seq(9, 27, 0.5), 40, replace= T),
                            P= sample(seq(0.3, 4, 0.1), 40, replace= T),
                            C= sample(seq(400, 500, 1), 40, replace= T))
df2           <- data.frame(origin= sample(c("A", "B", "C", "D", "E"), 40,
                                           replace= T),
                            foo1= sample(c(T, F), 40, replace= T),
                            X= sample(seq(145600, 148300, 100), 40, replace= T),
                            Y= sample(seq(349800, 398600, 100), 40, replace= T))
df3           <- matrix(sample(seq(0, 1, 0.01), 40), 40, 100)
df4           <- matrix(sample(seq(0, 1, 0.01), 40), 40, 100)
rownames(df1) <- paste("P", sprintf("%02d", c(1:40)), sep= "")
rownames(df2) <- rownames(df1)
rownames(df3) <- rownames(df1)
rownames(df4) <- rownames(df1)

This is what I would normally do:

# merge df1 and df2
dat           <- merge(df1, df2, by= "row.names", all.x= F, all.y= F) #merge
rownames(dat) <- dat$Row.names #reset rownames
dat$Row.names <- NULL  #remove added rownames col

# merge dat and df3
dat           <- merge(dat, df3, by= "row.names", all.x= F, all.y= F) #merge
rownames(dat) <- dat$Row.names #reset rownames
dat$Row.names <- NULL  #remove added rownames col

# merge dat and df4
dat           <- merge(dat, df4, by= "row.names", all.x= F, all.y= F) #merge
rownames(dat) <- dat$Row.names #reset rownames
dat$Row.names <- NULL #remove added rownames col

As you can see, this requires a lot of code. My question is if the same result can be achieved with more simple means. I've tried (without success): UPDATE: this works now!

MyMerge       <- function(x, y){
  df            <- merge(x, y, by= "row.names", all.x= F, all.y= F)
  rownames(df)  <- df$Row.names
  df$Row.names  <- NULL
  return(df)
}
dat           <- Reduce(MyMerge, list(df1, df2, df3, df4))
starball
  • 20,030
  • 7
  • 43
  • 238
Hans Roelofsen
  • 741
  • 1
  • 7
  • 13
  • 1
    What exactly do you mean by `without success`? Please be more specific, include errors. Even better, create a reproducible example. – Paul Hiemstra May 21 '13 at 09:54
  • 1.) If the row names are are so important to your data structure, that you merge by those, why don't you just spend the `data.frame` a true column for that? Which saves you most of the coding. 2.) Even if you keep them you could save a lot of coding, see `merge` parameters `by.x` and `by.y` 3.) Removing a column from a data.frame can be achieved with `df$Row.Names <- NULL` 4.) The `Reduce` approach should actually work, I'm also wondering why this would fail. – Beasterfield May 21 '13 at 11:20
  • I've included some example data. I also found that the suggested approach with does work after all. The problem was that I wanted to merge a single column from a df, thereby removing the rownames information. – Hans Roelofsen May 21 '13 at 11:43
  • However, in this setup only intersecting rownames are retained `all.x= F` and `all.y= F`. Would it be possible to retain all rows of df1, but exclude rows from the other df's that are not `%in% rownames(df1)`, i.e. `all.x= T, all.y= F`. – Hans Roelofsen May 21 '13 at 11:55
  • ok, O've got that last issue covered as well. Just adjust `all.x= T, all.y= T` in the `MyMerge' function. Thanks for having a look @Paul and @Beasterfield. – Hans Roelofsen May 21 '13 at 12:18
  • I've upvoted the question, so you should now have the rep to post your answer as an **answer** (which is encouraged), rather than as an edit to your question. – Ben Bolker May 21 '13 at 12:28
  • Duplicate of [Simultaneously merge multiple data.frames in a list](https://stackoverflow.com/questions/8091303/simultaneously-merge-multiple-data-frames-in-a-list) – M-- Jan 07 '20 at 18:07

4 Answers4

51

join_all from plyr will probably do what you want. But they all must be data frames and the rownames are added as a column

require(plyr)

df3 <- data.frame(df3)
df4 <- data.frame(df4)

df1$rn <- rownames(df1)
df2$rn <- rownames(df2)
df3$rn <- rownames(df3)
df4$rn <- rownames(df4)

df <- join_all(list(df1,df2,df3,df4), by = 'rn', type = 'full')

type argument should help even if the rownames vary and do not match If you do not want the rownames:

df$rn <- NULL
Anto
  • 1,189
  • 10
  • 16
15

Editing your function, I have came up with the function which allows you to merge more data frames by a specific column key (name of the column). The resulted data frame includes all the variable of the merged data frames (if you wanna keep just the common variables (excluding NA, use: all.x= FALSE, all.y= FALSE)

MyMerge <- function(x, y){
  df <- merge(x, y, by= "name of the common column", all.x= TRUE, all.y= TRUE)
  return(df)
}
new.df <- Reduce(MyMerge, list(df1, df2, df3, df4))
David Arenburg
  • 91,361
  • 17
  • 137
  • 196
Roxana Adam
  • 151
  • 1
  • 2
  • 1
    Nice function, is there anyway this function can rename column names, and give diff column names to common column names ? – Chirag Mar 24 '16 at 02:29
  • After merging the two data frames I usually use the function "fix()" which allows you to edit the data base - therefore you can also rename the columns. – Roxana Adam Mar 27 '16 at 13:04
8

I have been looking for the same function. After trying a couple of the options here and others elsewhere. The easiest for me was:

cbind.data.frame( df1,df2,df3,df4....)
M--
  • 25,431
  • 8
  • 61
  • 93
Jorge
  • 2,181
  • 1
  • 19
  • 30
7

Three lines of code will give you the exact same result:

dat2 <- cbind(df1, df2, df3, df4)
colnames(dat2)[-(1:7)] <- paste(paste('V', rep(1:100, 2),sep = ''),
                            rep(c('x', 'y'), each = 100), sep = c('.'))
all.equal(dat,dat2)    

Ah I see, now I understand why you are getting into so much pain. Using the old for loop surely does the trick. Maybe there are even more clever solutions

rn <- rownames(df1)
l <- list(df1, df2, df3, df4)
dat <- l[[1]]
for(i in 2:length(l)) {
  dat <- merge(dat, l[[i]],  by= "row.names", all.x= F, all.y= F) [,-1]
  rownames(dat) <- rn
}
Edwin
  • 3,184
  • 1
  • 23
  • 25
  • Hi, thanks for your reply. I see how it works out. However, and I admit I havn't made that clear in my example data, I want this to work also when the `rownames` are dissimilar. So in the example the rownames are equal, but the processing should still work when the rows are shuffled, or if e.g. `df2` has more or less rows. That's why I opted for `merge`. – Hans Roelofsen May 21 '13 at 14:57