338

I have a list of many data.frames that I want to merge. The issue here is that each data.frame differs in terms of the number of rows and columns, but they all share the key variables (which I've called "var1" and "var2" in the code below). If the data.frames were identical in terms of columns, I could merely rbind, for which plyr's rbind.fill would do the job, but that's not the case with these data.

Because the merge command only works on 2 data.frames, I turned to the Internet for ideas. I got this one from here, which worked perfectly in R 2.7.2, which is what I had at the time:

merge.rec <- function(.list, ...){
    if(length(.list)==1) return(.list[[1]])
    Recall(c(list(merge(.list[[1]], .list[[2]], ...)), .list[-(1:2)]), ...)
}

And I would call the function like so:

df <- merge.rec(my.list, by.x = c("var1", "var2"), 
                by.y = c("var1", "var2"), all = T, suffixes=c("", ""))

But in any R version after 2.7.2, including 2.11 and 2.12, this code fails with the following error:

Error in match.names(clabs, names(xi)) : 
  names do not match previous names

(Incidently, I see other references to this error elsewhere with no resolution).

Is there any way to solve this?

Community
  • 1
  • 1
bshor
  • 4,859
  • 8
  • 24
  • 33

9 Answers9

313

Another question asked specifically how to perform multiple left joins using dplyr in R . The question was marked as a duplicate of this one so I answer here, using the 3 sample data frames below:

x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE)
y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE)
z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE)

The answer is divided in three sections representing three different ways to perform the merge. You probably want to use the purrr way if you are already using the tidyverse packages. For comparison purposes below, you'll find a base R version using the same sample dataset.


1) Join them with reduce from the purrr package:

The purrr package provides a reduce function which has a concise syntax:

library(tidyverse)
list(x, y, z) %>% reduce(left_join, by = "i")
#  A tibble: 3 x 4
#  i       j     k     l
#  <chr> <int> <int> <int>
# 1 a      1    NA     9
# 2 b      2     4    NA
# 3 c      3     5     7

You can also perform other joins, such as a full_join or inner_join:

list(x, y, z) %>% reduce(full_join, by = "i")
# A tibble: 4 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 a     1     NA     9
# 2 b     2     4      NA
# 3 c     3     5      7
# 4 d     NA    6      8

list(x, y, z) %>% reduce(inner_join, by = "i")
# A tibble: 1 x 4
# i       j     k     l
# <chr> <int> <int> <int>
# 1 c     3     5     7

2) dplyr::left_join() with base R Reduce():

list(x,y,z) %>%
    Reduce(function(dtf1,dtf2) left_join(dtf1,dtf2,by="i"), .)

#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7

3) Base R merge() with base R Reduce():

And for comparison purposes, here is a base R version of the left join based on Charles's answer.

 Reduce(function(dtf1, dtf2) merge(dtf1, dtf2, by = "i", all.x = TRUE),
        list(x,y,z))
#   i j  k  l
# 1 a 1 NA  9
# 2 b 2  4 NA
# 3 c 3  5  7
Paul Rougieux
  • 10,289
  • 4
  • 68
  • 110
  • 1
    The full_join variant works perfectly, and looks a lot less scary than the accepted answer. Not much of a speed difference, though. – bshor Dec 21 '16 at 04:44
  • 1
    @Axeman is right, but you might be able to avoid (visibly) returning a list of data frames at all by using `map_dfr()` or `map_dfc()` – DaveRGP Jun 30 '17 at 08:42
  • I though I could join a number of DF based on a pattern using ´ls(pattern = "DF_name_contains_this" )´, but no. Used ´noquote( paste(())´, but I'm still producing a character vector instead of a list of DF. I ended up typing the names, which is obnoxious. – aae Mar 07 '19 at 16:35
  • Another question provides a [python implementation](https://stackoverflow.com/a/53645883/2641825): list of pandas data frames `dfs = [df1, df2, df3]` then `reduce(pandas.merge, dfs)`. – Paul Rougieux Mar 18 '19 at 13:45
  • How can you add a suffix to avoid automatically appending of ".y" or ".x"? – jgarces Nov 19 '20 at 16:11
  • @jgarces see the suffix section in `help(left_join)`. – Paul Rougieux Nov 19 '21 at 09:49
240

Reduce makes this fairly easy:

merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)

Here's a fully example using some mock data:

set.seed(1)
list.of.data.frames = list(data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)))
merged.data.frame = Reduce(function(...) merge(..., all=T), list.of.data.frames)
tail(merged.data.frame)
#    x  a  b         y
#12 12 NA 18        NA
#13 13 NA 19        NA
#14 14 NA 20 0.4976992
#15 15 NA NA 0.7176185
#16 16 NA NA 0.3841037
#17 19 NA NA 0.3800352

And here's an example using these data to replicate my.list:

merged.data.frame = Reduce(function(...) merge(..., by=match.by, all=T), my.list)
merged.data.frame[, 1:12]

#  matchname party st district chamber senate1993 name.x v2.x v3.x v4.x senate1994 name.y
#1   ALGIERE   200 RI      026       S         NA   <NA>   NA   NA   NA         NA   <NA>
#2     ALVES   100 RI      019       S         NA   <NA>   NA   NA   NA         NA   <NA>
#3    BADEAU   100 RI      032       S         NA   <NA>   NA   NA   NA         NA   <NA>

Note: It looks like this is arguably a bug in merge. The problem is there is no check that adding the suffixes (to handle overlapping non-matching names) actually makes them unique. At a certain point it uses [.data.frame which does make.unique the names, causing the rbind to fail.

# first merge will end up with 'name.x' & 'name.y'
merge(my.list[[1]], my.list[[2]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y
#<0 rows> (or 0-length row.names)
# as there is no clash, we retain 'name.x' & 'name.y' and get 'name' again
merge(merge(my.list[[1]], my.list[[2]], by=match.by, all=T), my.list[[3]], by=match.by, all=T)
# [1] matchname    party        st           district     chamber      senate1993   name.x      
# [8] votes.year.x senate1994   name.y       votes.year.y senate1995   name         votes.year  
#<0 rows> (or 0-length row.names)
# the next merge will fail as 'name' will get renamed to a pre-existing field.

Easiest way to fix is to not leave the field renaming for duplicates fields (of which there are many here) up to merge. Eg:

my.list2 = Map(function(x, i) setNames(x, ifelse(names(x) %in% match.by,
      names(x), sprintf('%s.%d', names(x), i))), my.list, seq_along(my.list))

The merge/Reduce will then work fine.

Thomas
  • 43,637
  • 12
  • 109
  • 140
Charles
  • 4,389
  • 2
  • 16
  • 13
  • Thanks! I saw this solution also on the link from Ramnath. Looks easy enough. But I get the following error: "Error in match.names(clabs, names(xi)) : names do not match previous names". The variables I'm matching on are all present in all the dataframes in the list, so I'm not catching what this error is telling me. – bshor Nov 11 '11 at 21:49
  • 1
    I tested this solution on R2.7.2 and I get the same match.names error. So there's some more fundamental problem with this solution and my data. I used the code: Reduce(function(x, y) merge(x, y, all=T,by.x=match.by, by.y=match.by), my.list, accumulate=F) – bshor Nov 14 '11 at 19:28
  • 1
    Strange, I added the code that I tested it with which runs fine. I guess there is some field-renaming occurring based on the merge args you're using? The merged result must still have the relevant keys in order to be merged with the subsequent data frame. – Charles Nov 14 '11 at 20:12
  • I suspect something happening with empty data frames. I tried out some examples like this: `empty <- data.frame(x=numeric(0),a=numeric(0); L3 <- c(empty,empty,list.of.data.frames,empty,empty,empty)` and got some weird stuff happening that I haven't figured out yet. – Ben Bolker Nov 14 '11 at 22:10
  • @Charles You're onto something. Your code runs fine above for me. And when I adapt it to mine, it runs fine too -- except that it does a merge ignoring the key variables I want. When I try to add key variables rather than leave them out, I get a new error "Error in is.null(x) : 'x' is missing". The code line is "test.reduce <- Reduce(function(...) merge(by=match.by, all=T), my.list)" where match.by are the vector of key variable names I want merged by. – bshor Nov 15 '11 at 19:56
  • @BenBolker No it can't be empty data frames; your code isn't right. It should be `L3 <- list(empty,empty, data.frame(x=1:10, a=1:10), data.frame(x=5:14, b=11:20), data.frame(x=sample(20, 10), y=runif(10)),empty,empty,empty)` and then `m3 = Reduce(function(...) merge(..., all=T), L3)` works just fine. – bshor Nov 15 '11 at 20:08
  • @Charles Sorry that last code line is wrong. When properly rewritten as `test.reduce <- Reduce(function(...) merge(..., by=match.by, all=T), my.list)` I get the same old "match.names" error. – bshor Nov 15 '11 at 20:11
  • Still not able to replicate the problem - see updated answer. Can you provide better sample data? Maybe just save(my.list, file='my.list.RData') and upload? – Charles Nov 15 '11 at 20:28
  • @Charles You've gone above and beyond -- thanks. I uploaded replication code and real data that is accessed via url to show you the problem. Thanks for showing me pastebin. – bshor Nov 15 '11 at 22:48
  • The reason it worked without match.by is because it wasn't doing field renaming on name, but rather including that in key. – Charles Nov 16 '11 at 13:26
  • @Charles Wow; this did it! Both "Recall" and "Reduce" solutions work fine now in 2.12. Thank you. I've never really run into a bug in core R code before... I do wonder why "Recall" worked in 2.7.2 but not now. – bshor Nov 16 '11 at 18:51
59

You can do it using merge_all in the reshape package. You can pass parameters to merge using the ... argument

reshape::merge_all(list_of_dataframes, ...)

Here is an excellent resource on different methods to merge data frames.

shadow
  • 21,823
  • 4
  • 63
  • 77
Ramnath
  • 54,439
  • 16
  • 125
  • 152
  • looks like I just replicated merge_recurse =) good to know this function already exists. – Suraj Nov 11 '11 at 15:29
  • 17
    yes. whenever i have an idea, i always check if @hadley has already done it, and most of the times he has :-) – Ramnath Nov 11 '11 at 15:33
  • 1
    I'm a little confused; should I do merge_all or merge_recurse? In any case, when I try to add in my additional arguments to either, I get the error "formal argument "all" matched by multiple actual arguments". – bshor Nov 11 '11 at 21:42
  • @bshor. it would be useful to post a few lines of your original data frames, so that your error is reproducible. you can easily do it using `dput`. – Ramnath Nov 12 '11 at 00:03
  • 3
    I think I dropped this from reshape2. Reduce + merge is just as simple. – hadley Nov 12 '11 at 04:45
  • @Ramnath Yikes. My list has 19 data frames, each about 48-50 rows and 600 columns! dput puts tons of data on screen. What's the best way to summarize? – bshor Nov 14 '11 at 18:39
  • @Ramnath I updated the original post with my attempt to use a real example from my code, but shortening up the data frames for exposition. – bshor Nov 14 '11 at 18:53
  • 2
    @Ramnath, link is dead, is there a mirror? – Eduardo Oct 22 '14 at 07:29
8

We can use {powerjoin}.

Borrowing sample data from accepted answer:

x <- data.frame(i = c("a","b","c"), j = 1:3, stringsAsFactors=FALSE)
y <- data.frame(i = c("b","c","d"), k = 4:6, stringsAsFactors=FALSE)
z <- data.frame(i = c("c","d","a"), l = 7:9, stringsAsFactors=FALSE)

library(powerjoin)
power_full_join(list(x,y,z), by = "i")
#>   i  j  k  l
#> 1 a  1 NA  9
#> 2 b  2  4 NA
#> 3 c  3  5  7
#> 4 d NA  6  8

power_left_join(list(x,y,z), by = "i")
#>   i j  k  l
#> 1 a 1 NA  9
#> 2 b 2  4 NA
#> 3 c 3  5  7

You might also start with a dataframe and join a list of data frames, for the same result


power_full_join(x, list(y,z), by = "i")
#>   i  j  k  l
#> 1 a  1 NA  9
#> 2 b  2  4 NA
#> 3 c  3  5  7
#> 4 d NA  6  8
moodymudskipper
  • 46,417
  • 11
  • 121
  • 167
6

You can use recursion to do this. I haven't verified the following, but it should give you the right idea:

MergeListOfDf = function( data , ... )
{
    if ( length( data ) == 2 ) 
    {
        return( merge( data[[ 1 ]] , data[[ 2 ]] , ... ) )
    }    
    return( merge( MergeListOfDf( data[ -1 ] , ... ) , data[[ 1 ]] , ... ) )
}
Suraj
  • 35,905
  • 47
  • 139
  • 250
5

I will reuse the data example from @PaulRougieux

x <- data_frame(i = c("a","b","c"), j = 1:3)
y <- data_frame(i = c("b","c","d"), k = 4:6)
z <- data_frame(i = c("c","d","a"), l = 7:9)

Here's a short and sweet solution using purrr and tidyr

library(tidyverse)

 list(x, y, z) %>% 
  map_df(gather, key=key, value=value, -i) %>% 
  spread(key, value)
dmi3kno
  • 2,943
  • 17
  • 31
1

I had a list of dataframes with no common id column.
I had missing data on many dfs. There were Null values. The dataframes were produced using table function. The Reduce, Merging, rbind, rbind.fill, and their like could not help me to my aim. My aim was to produce an understandable merged dataframe, irrelevant of the missing data and common id column.

Therefore, I made the following function. Maybe this function can help someone.

##########################################################
####             Dependencies                        #####
##########################################################

# Depends on Base R only

##########################################################
####             Example DF                          #####
##########################################################

# Example df
ex_df           <- cbind(c( seq(1, 10, 1), rep("NA", 0), seq(1,10, 1) ), 
                         c( seq(1, 7, 1),  rep("NA", 3), seq(1, 12, 1) ), 
                         c( seq(1, 3, 1),  rep("NA", 7), seq(1, 5, 1), rep("NA", 5) ))

# Making colnames and rownames
colnames(ex_df) <- 1:dim(ex_df)[2]
rownames(ex_df) <- 1:dim(ex_df)[1]

# Making an unequal list of dfs, 
# without a common id column
list_of_df      <- apply(ex_df=="NA", 2, ( table) )

it is following the function

##########################################################
####             The function                        #####
##########################################################


# The function to rbind it
rbind_null_df_lists <- function ( list_of_dfs ) {
  length_df     <- do.call(rbind, (lapply( list_of_dfs, function(x) length(x))))
  max_no        <- max(length_df[,1])
  max_df        <- length_df[max(length_df),]
  name_df       <- names(length_df[length_df== max_no,][1])
  names_list    <- names(list_of_dfs[ name_df][[1]])

  df_dfs <- list()
  for (i in 1:max_no ) {

    df_dfs[[i]]            <- do.call(rbind, lapply(1:length(list_of_dfs), function(x) list_of_dfs[[x]][i]))

  }

  df_cbind               <- do.call( cbind, df_dfs )
  rownames( df_cbind )   <- rownames (length_df)
  colnames( df_cbind )   <- names_list

  df_cbind

}

Running the example

##########################################################
####             Running the example                 #####
##########################################################

rbind_null_df_lists ( list_of_df )
Estatistics
  • 874
  • 9
  • 24
1

Here is a generic wrapper which can be used to convert a binary function to multi-parameters function. The benefit of this solution is that it is very generic and can be applied to any binary functions. You just need to do it once and then you can apply it any where.

To demo the idea, I use simple recursion to implement. It can be of course implemented with more elegant way that benefits from R's good support for functional paradigm.

fold_left <- function(f) {
return(function(...) {
    args <- list(...)
    return(function(...){
    iter <- function(result,rest) {
        if (length(rest) == 0) {
            return(result)
        } else {
            return(iter(f(result, rest[[1]], ...), rest[-1]))
        }
    }
    return(iter(args[[1]], args[-1]))
    })
})}

Then you can simply wrap any binary functions with it and call with positional parameters (usually data.frames) in the first parentheses and named parameters in the second parentheses (such as by = or suffix =). If no named parameters, leave second parentheses empty.

merge_all <- fold_left(merge)
merge_all(df1, df2, df3, df4, df5)(by.x = c("var1", "var2"), by.y = c("var1", "var2"))

left_join_all <- fold_left(left_join)
left_join_all(df1, df2, df3, df4, df5)(c("var1", "var2"))
left_join_all(df1, df2, df3, df4, df5)()
englealuze
  • 1,445
  • 12
  • 19
0

When you have a list of dfs, and a column contains the "ID", but in some lists, some IDs are missing, then you may use this version of Reduce / Merge in order to join multiple Dfs of missing Row Ids or labels:

Reduce(function(x, y) merge(x=x, y=y, by="V1", all.x=T, all.y=T), list_of_dfs)
Estatistics
  • 874
  • 9
  • 24