Merging a lot of data.frames

Question

Possible Duplicate:
Merge multiple data frames in a list simultaneously

example data.frames:

 df1 = data.frame(id=c('1','73','2','10','43'),v1=c(1,2,3,4,5)) <br>
 df2 = data.frame(id=c('7','23','57','2','62','96'),v2=c(1,2,3,4,5,6)) <br>
 df3 = data.frame(id=c('23','62'),v3=c(1,2)) <br>

Note: id is unique for each data.frame. I want the resulting matrix to look like

1      1 NA NA 
2      3  4 NA 
7      NA 1 NA 
10     4 NA NA 
23     NA 2  1 
43     5 NA NA 
57     NA 3 NA 
62     NA 5  2 
73     2 NA NA 
96     NA 6 NA

In this case, I only show 3 datasets, I actually have at least 22 of them so at the end I want a matrix of nx(22+1) where n is the number of ids for all 22 datasets.

Given 2 datasets, I need to get their ids in the first column and 2nd and 3rd columns are filled with the values, if there is no value exists, then input NA instead.

This is not a solution but in addition to what is stated by @Matthew Plourde. You can build list of data.frames: df_list <- lapply(paste0("df",1:22), as.name). — Wojciech Sobala, Dec 31 '12 at 09:02
Even though this thread may be duplicate of another, but both questions and answers are presented in a more readable way. — biocyberman, Feb 15 '16 at 09:54

Matthew Plourde · Accepted Answer · 2015-04-22T17:10:57.303

203

Put them into a list and use merge with Reduce

Reduce(function(x, y) merge(x, y, all=TRUE), list(df1, df2, df3))
#    id v1 v2 v3
# 1   1  1 NA NA
# 2  10  4 NA NA
# 3   2  3  4 NA
# 4  43  5 NA NA
# 5  73  2 NA NA
# 6  23 NA  2  1
# 7  57 NA  3 NA
# 8  62 NA  5  2
# 9   7 NA  1 NA
# 10 96 NA  6 NA

You can also use this more concise version:

Reduce(function(...) merge(..., all=TRUE), list(df1, df2, df3))

edited Apr 22 '15 at 17:10

answered Dec 31 '12 at 02:40

Matthew Plourde

43,932
7
96
113

13

+1 for `Reduce`. For this simple example, this is equivalent to `merge(merge(df1, df2, by='id', all=T), df3, by='id', all=T)`. Clearly a loop could be used, iterating through the data frames -- but that's exactly what `Reduce` does. – Matthew Lundberg Dec 31 '12 at 02:40
4

I'm thankful `Reduce` is in the language, but I really wish it were more like the `*apply` functions, letting you give it additional arguments for the functional supplied. I hate that I have to embed a function definition just to use `merge` with `all=TRUE`. – Matthew Plourde Dec 31 '12 at 02:48
2

`merge_recurse` and `merge_all` from the (older) reshape package are a decent guide for how to build something that does this for you in a more convenient form. – joran Dec 31 '12 at 02:52
3

Can I ask how to modify the function if we would have to account for different `ids` across those data frames? – Konrad Jul 06 '15 at 15:45
I think it would be easiest to standardize the id column names. The hack to handle this with `Reduce` would get kind of obscure. – Matthew Plourde Jul 06 '15 at 15:48
4

I tried this method and it is really slow with 500 lists each with 125 rows. Are there any other fast methods – Polla A. Fattah Sep 07 '15 at 13:19
@PollaA.Fattah take a look at the join functions in the dplyr package – Matthew Plourde Sep 07 '15 at 13:21
@MatthewPlourde Thank you for your reply it turns out my problem was simpler than what I thought of and I used rbind eventually. – Polla A. Fattah Sep 07 '15 at 14:02
It was fast (immediate) with 8 data frames with ~3000 rows each – Scott Worland Feb 16 '16 at 23:29
1

I am using merge and reduce, but geeting the following error: Error: cannot allocate vector of size 2.5 Gb, please help – gaurav kumar Jan 02 '17 at 04:31
@gauravkumar Welcome to the world of big data. You'll want to check out the CRAN task view on high-performance computing (https://cran.r-project.org/web/views/HighPerformanceComputing.html), especially the section called "Large memory and out-of-memory data" – rsoren Nov 03 '17 at 23:08
How to handle if one of tables is null? It affects the results but should be automatically excluded – Peter.k Jan 26 '18 at 23:35
@Peter.k just filter first – Matthew Plourde Jan 28 '18 at 23:04
1

I faced a warning when `by = 0` saying the "Row.name" is duplicated. It turned out that the merge function does something stupid when trying to merge based on row.names. The solution I found is to move the row.names into a column and use that to merge. Hopefully this is useful to those who are desperately looking for an answer about the warning and misbehave of `merge`. – Mehrad Mahmoudian Jun 12 '20 at 22:12
@MehradMahmoudian how did you do that? – Luis Nov 29 '21 at 19:47
1

@Luis basically just create a new column for each dataframe/matrix and fill it with the rownames (e.g `df$rownames <- row.names(df)`) and then while merging use `by = "rownames"`. – Mehrad Mahmoudian Dec 01 '21 at 10:20
I wonder how to use the `suffixes` argument from `merge.data.table` when multiple dataframes have columns with the same name, not counting the matching column. – KVemuri Feb 09 '22 at 17:03

Merging a lot of data.frames

1 Answers1

Linked

Related