0

I have ~40K data frames in a list. Each data frame has 7 variables, 3 factors and 4 numeric. For reference, here is the first data frame:

 $ a:'data.frame':  4 obs. of  7 variables:
  ..$ x1      : Factor w/ 1 level "a": 1 1 1 1
  ..$ x2        : Factor w/ 4 levels "12345678901234",..: 1 2 3 4
  ..$ x3    : Factor w/ 4 levels "SAMPLE",..: 1 2 3 4
  ..$ x4       : int [1:4] 1 2 3 4
  ..$ x5      : num [1:4] 10 20 30 40
  ..$ x6: int [1:4] 50 60 70 80
  ..$ x7   : num [1:4] 0.5 0.7 0.35 1

I'm trying to merge these into a single ginormous data frame, using:

Reduce(function(...) merge(..., all=T), df_list)

As recommended here: Simultaneously merge multiple data.frames in a list.

If I take the first 1000 items, i.e.

Reduce(function(...) merge(..., all=T), df_list[1:1000])

This produces the desired result (merges the individual data frames into a single one) and completes in 37 seconds.

However, running Reduce() on the entire 40K list of data frames takes an inordinate amount of time.. I've let it run >5 hrs and it doesn't appear to complete.

Are there any tricks that I can use to improve the performance of Reduce(), or is there a better alternative?

Community
  • 1
  • 1
argoneus
  • 701
  • 1
  • 7
  • 14
  • Can I ask how many rows the 40k data frames have? – joran Aug 06 '13 at 20:32
  • 2
    Are you sure you want to `merge` them, not `rbind` them, as in `do.call(rbind, df_list)`? – flodel Aug 06 '13 at 20:38
  • It varies by each df, but the total number of observations across all data frames is just over one million. – argoneus Aug 06 '13 at 20:39
  • the R wiki has some advice (possibly out-of-date and wrong): http://rwiki.sciviews.org/doku.php?id=tips:data-frames:merge&s=merge#merge_multiple_data.frames – baptiste Aug 06 '13 at 20:42
  • 1
    1 million rows, 7 columns isn't *that* much. @flodel 's `do.call(rbind, df_list)` should do the trick... – Simon O'Hanlon Aug 06 '13 at 20:44
  • `rbind` may do the trick, I'll try that. Thanks! – argoneus Aug 06 '13 at 20:44
  • 2
    You could try `rbindlist` from `data.table` instead of `rbind` (although there are some things you'll have to take care of manually reg. data-type). – Arun Aug 06 '13 at 20:51
  • If you are sure `rbind` is enough, check [here](http://stackoverflow.com/questions/18003717/is-there-any-efficient-way-than-rbind-filllist/18063131#18063131) for speed ups. `data.table::rbindlist` is probably what you need. – Xachriel Aug 06 '13 at 20:53
  • @Xachriel,Arun thanks, I'll check out the parallelized rbindlist. – argoneus Aug 06 '13 at 21:02
  • `rbind` did the trick, and finished in 606s! @flodel I'll accept your answer if you submit it. Thanks for all of the comments and suggestions everyone – argoneus Aug 06 '13 at 21:09
  • You should check that rbinding really _is_ what you want, as opposed to merging. Otherwise, you'll have taken 606 seconds to obtain nonsense. – Hong Ooi Aug 06 '13 at 21:14
  • @HongOoi yes, `merge` and `rbind` produce the same output in this case – argoneus Aug 07 '13 at 01:25

1 Answers1

0

If you really needed merge and not just rbind, you could first merge them two by two (1 and 2, 3 and 4, 5 and 6, etc.), then merge the resulting data.frames two by two, and so on, until there is only one remaining data.frame.

# One step
merge_some <- function(l, ...) {
  n <- length(l)
  k <- floor(n/2)
  result <- list()
  for(i in 1:k) {
    result[[i]] <- merge(l[[2*i-1]], l[[2*i]], ...)
  }
  if( 2*k < n ) {
    result[[k+1]] <- l[[n]]
  }
  result
}

# Sample data
d <- lapply(1:1000, function(i) {
  r <- data.frame(id = sample(1:100,3), v = rnorm(3))
  names(r)[[2]] <- paste0("v",i)
  r
} )

# Iterate until there is only one data.frame left
while( length(d) > 1 ) {
  d <- merge_some(d, by="id", all=TRUE)
} 

# Result
head(d[[1]])
Vincent Zoonekynd
  • 31,893
  • 5
  • 69
  • 78