Optimize or an alternative to Reduce(merge, list_of_data_frames) in R

Question

I have ~40K data frames in a list. Each data frame has 7 variables, 3 factors and 4 numeric. For reference, here is the first data frame:

 $ a:'data.frame':  4 obs. of  7 variables:
  ..$ x1      : Factor w/ 1 level "a": 1 1 1 1
  ..$ x2        : Factor w/ 4 levels "12345678901234",..: 1 2 3 4
  ..$ x3    : Factor w/ 4 levels "SAMPLE",..: 1 2 3 4
  ..$ x4       : int [1:4] 1 2 3 4
  ..$ x5      : num [1:4] 10 20 30 40
  ..$ x6: int [1:4] 50 60 70 80
  ..$ x7   : num [1:4] 0.5 0.7 0.35 1

I'm trying to merge these into a single ginormous data frame, using:

Reduce(function(...) merge(..., all=T), df_list)

As recommended here: Simultaneously merge multiple data.frames in a list.

If I take the first 1000 items, i.e.

Reduce(function(...) merge(..., all=T), df_list[1:1000])

This produces the desired result (merges the individual data frames into a single one) and completes in 37 seconds.

However, running Reduce() on the entire 40K list of data frames takes an inordinate amount of time.. I've let it run >5 hrs and it doesn't appear to complete.

Are there any tricks that I can use to improve the performance of Reduce(), or is there a better alternative?

Are you sure you want to `merge` them, not `rbind` them, as in `do.call(rbind, df_list)`? — flodel, Aug 06 '13 at 20:38
It varies by each df, but the total number of observations across all data frames is just over one million. — argoneus, Aug 06 '13 at 20:39
the R wiki has some advice (possibly out-of-date and wrong): http://rwiki.sciviews.org/doku.php?id=tips:data-frames:merge&s=merge#merge_multiple_data.frames — baptiste, Aug 06 '13 at 20:42
1 million rows, 7 columns isn't *that* much. @flodel 's `do.call(rbind, df_list)` should do the trick... — Simon O'Hanlon, Aug 06 '13 at 20:44
You could try `rbindlist` from `data.table` instead of `rbind` (although there are some things you'll have to take care of manually reg. data-type). — Arun, Aug 06 '13 at 20:51
If you are sure `rbind` is enough, check [here](http://stackoverflow.com/questions/18003717/is-there-any-efficient-way-than-rbind-filllist/18063131#18063131) for speed ups. `data.table::rbindlist` is probably what you need. — Xachriel, Aug 06 '13 at 20:53
@Xachriel,Arun thanks, I'll check out the parallelized rbindlist. — argoneus, Aug 06 '13 at 21:02
`rbind` did the trick, and finished in 606s! @flodel I'll accept your answer if you submit it. Thanks for all of the comments and suggestions everyone — argoneus, Aug 06 '13 at 21:09
You should check that rbinding really _is_ what you want, as opposed to merging. Otherwise, you'll have taken 606 seconds to obtain nonsense. — Hong Ooi, Aug 06 '13 at 21:14
@HongOoi yes, `merge` and `rbind` produce the same output in this case — argoneus, Aug 07 '13 at 01:25

score 0 · Answer 1 · answered Aug 06 '13 at 21:09

If you really needed merge and not just rbind, you could first merge them two by two (1 and 2, 3 and 4, 5 and 6, etc.), then merge the resulting data.frames two by two, and so on, until there is only one remaining data.frame.

# One step
merge_some <- function(l, ...) {
  n <- length(l)
  k <- floor(n/2)
  result <- list()
  for(i in 1:k) {
    result[[i]] <- merge(l[[2*i-1]], l[[2*i]], ...)
  }
  if( 2*k < n ) {
    result[[k+1]] <- l[[n]]
  }
  result
}

# Sample data
d <- lapply(1:1000, function(i) {
  r <- data.frame(id = sample(1:100,3), v = rnorm(3))
  names(r)[[2]] <- paste0("v",i)
  r
} )

# Iterate until there is only one data.frame left
while( length(d) > 1 ) {
  d <- merge_some(d, by="id", all=TRUE)
} 

# Result
head(d[[1]])

Optimize or an alternative to Reduce(merge, list_of_data_frames) in R

1 Answers1