I have ~40K data frames in a list. Each data frame has 7 variables, 3 factors and 4 numeric. For reference, here is the first data frame:
$ a:'data.frame': 4 obs. of 7 variables:
..$ x1 : Factor w/ 1 level "a": 1 1 1 1
..$ x2 : Factor w/ 4 levels "12345678901234",..: 1 2 3 4
..$ x3 : Factor w/ 4 levels "SAMPLE",..: 1 2 3 4
..$ x4 : int [1:4] 1 2 3 4
..$ x5 : num [1:4] 10 20 30 40
..$ x6: int [1:4] 50 60 70 80
..$ x7 : num [1:4] 0.5 0.7 0.35 1
I'm trying to merge these into a single ginormous data frame, using:
Reduce(function(...) merge(..., all=T), df_list)
As recommended here: Simultaneously merge multiple data.frames in a list.
If I take the first 1000 items, i.e.
Reduce(function(...) merge(..., all=T), df_list[1:1000])
This produces the desired result (merges the individual data frames into a single one) and completes in 37 seconds.
However, running Reduce()
on the entire 40K list of data frames takes an inordinate amount of time.. I've let it run >5 hrs and it doesn't appear to complete.
Are there any tricks that I can use to improve the performance of Reduce()
, or is there a better alternative?