88

Possible Duplicate:
Merge multiple data frames in a list simultaneously

example data.frames:

 df1 = data.frame(id=c('1','73','2','10','43'),v1=c(1,2,3,4,5)) <br>
 df2 = data.frame(id=c('7','23','57','2','62','96'),v2=c(1,2,3,4,5,6)) <br>
 df3 = data.frame(id=c('23','62'),v3=c(1,2)) <br>

Note: id is unique for each data.frame. I want the resulting matrix to look like

1      1 NA NA 
2      3  4 NA 
7      NA 1 NA 
10     4 NA NA 
23     NA 2  1 
43     5 NA NA 
57     NA 3 NA 
62     NA 5  2 
73     2 NA NA 
96     NA 6 NA

In this case, I only show 3 datasets, I actually have at least 22 of them so at the end I want a matrix of nx(22+1) where n is the number of ids for all 22 datasets.

Given 2 datasets, I need to get their ids in the first column and 2nd and 3rd columns are filled with the values, if there is no value exists, then input NA instead.

lmo
  • 37,904
  • 9
  • 56
  • 69
user1938809
  • 1,135
  • 1
  • 9
  • 12
  • This is not a solution but in addition to what is stated by @Matthew Plourde. You can build list of data.frames: df_list <- lapply(paste0("df",1:22), as.name). – Wojciech Sobala Dec 31 '12 at 09:02
  • 1
    Even though this thread may be duplicate of another, but both questions and answers are presented in a more readable way. – biocyberman Feb 15 '16 at 09:54

1 Answers1

203

Put them into a list and use merge with Reduce

Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
#    id v1 v2 v3
# 1   1  1 NA NA
# 2  10  4 NA NA
# 3   2  3  4 NA
# 4  43  5 NA NA
# 5  73  2 NA NA
# 6  23 NA  2  1
# 7  57 NA  3 NA
# 8  62 NA  5  2
# 9   7 NA  1 NA
# 10 96 NA  6 NA

You can also use this more concise version:

Reduce(function(...) merge(..., all=TRUE), list(df1, df2, df3))
Matthew Plourde
  • 43,932
  • 7
  • 96
  • 113
  • 13
    +1 for `Reduce`. For this simple example, this is equivalent to `merge(merge(df1, df2, by='id', all=T), df3, by='id', all=T)`. Clearly a loop could be used, iterating through the data frames -- but that's exactly what `Reduce` does. – Matthew Lundberg Dec 31 '12 at 02:40
  • 4
    I'm thankful `Reduce` is in the language, but I really wish it were more like the `*apply` functions, letting you give it additional arguments for the functional supplied. I hate that I have to embed a function definition just to use `merge` with `all=TRUE`. – Matthew Plourde Dec 31 '12 at 02:48
  • 2
    `merge_recurse` and `merge_all` from the (older) reshape package are a decent guide for how to build something that does this for you in a more convenient form. – joran Dec 31 '12 at 02:52
  • 3
    Can I ask how to modify the function if we would have to account for different `ids` across those data frames? – Konrad Jul 06 '15 at 15:45
  • I think it would be easiest to standardize the id column names. The hack to handle this with `Reduce` would get kind of obscure. – Matthew Plourde Jul 06 '15 at 15:48
  • 4
    I tried this method and it is really slow with 500 lists each with 125 rows. Are there any other fast methods – Polla A. Fattah Sep 07 '15 at 13:19
  • @PollaA.Fattah take a look at the join functions in the dplyr package – Matthew Plourde Sep 07 '15 at 13:21
  • @MatthewPlourde Thank you for your reply it turns out my problem was simpler than what I thought of and I used rbind eventually. – Polla A. Fattah Sep 07 '15 at 14:02
  • It was fast (immediate) with 8 data frames with ~3000 rows each – Scott Worland Feb 16 '16 at 23:29
  • 1
    I am using merge and reduce, but geeting the following error: Error: cannot allocate vector of size 2.5 Gb, please help – gaurav kumar Jan 02 '17 at 04:31
  • @gauravkumar Welcome to the world of big data. You'll want to check out the CRAN task view on high-performance computing (https://cran.r-project.org/web/views/HighPerformanceComputing.html), especially the section called "Large memory and out-of-memory data" – rsoren Nov 03 '17 at 23:08
  • How to handle if one of tables is null? It affects the results but should be automatically excluded – Peter.k Jan 26 '18 at 23:35
  • @Peter.k just filter first – Matthew Plourde Jan 28 '18 at 23:04
  • 1
    I faced a warning when `by = 0` saying the "Row.name" is duplicated. It turned out that the merge function does something stupid when trying to merge based on row.names. The solution I found is to move the row.names into a column and use that to merge. Hopefully this is useful to those who are desperately looking for an answer about the warning and misbehave of `merge`. – Mehrad Mahmoudian Jun 12 '20 at 22:12
  • @MehradMahmoudian how did you do that? – Luis Nov 29 '21 at 19:47
  • 1
    @Luis basically just create a new column for each dataframe/matrix and fill it with the rownames (e.g `df$rownames <- row.names(df)`) and then while merging use `by = "rownames"`. – Mehrad Mahmoudian Dec 01 '21 at 10:20
  • I wonder how to use the `suffixes` argument from `merge.data.table` when multiple dataframes have columns with the same name, not counting the matching column. – KVemuri Feb 09 '22 at 17:03