3

I am currently using following code to merge >130 data frames and the code takes too many hours to run (I actually never got to the completion on such a big dataset, only on subsets). Each table has two columns: unit (string) and counts (integer). I am merging by units.

tables <- lapply(files, function(x) read.table(x), col.names=c("unit", x))))

MyMerge <- function(x, y){
  df <- merge(x, y, by="unit", all.x= TRUE, all.y= TRUE)
  return(df)
}

data <- Reduce(MyMerge, tables)

Is there any way to speed this up easily? Each table/dataframe separately has around 500,000 rows and many of those are unique to that table. Therefore, by merging multiple tables one quickly gets number of the rows of the merged dataframe to many millions..

At the end, I will drop rows with too low summary counts from my big merged table, but I don't want to to that during merging as the order of my files would matter then..

Perlnika
  • 4,796
  • 8
  • 36
  • 47
  • 1
    Look at http://stackoverflow.com/questions/1299871/how-to-join-merge-data-frames-inner-outer-left-right for some `data.table` solutions (`outer joins`). This will probably be faster. – coffeinjunky May 31 '16 at 16:54
  • Thanks @coffeinjunky. I tried dplyr package based on above mentioned thread by unfortunately that was slower in my case. – Perlnika May 31 '16 at 17:38
  • 1
    @coffeinjunky for one of the smaller datasets, I was able to get from 77 seconds down to 66 seconds. Not magical, but definitely helpful:) – Perlnika May 31 '16 at 18:12
  • Do you need to store the results of which table each number came from? Or are you going to aggregate it at the end? – Gregor Thomas May 31 '16 at 19:55

1 Answers1

3

Here a small comparison, first with a rather small dataset, then with a larger one:

library(data.table)
library(plyr)
library(dplyr)
library(microbenchmark)

# sample size: 
n = 4e3

# create some data.frames:
df_list <- lapply(1:100, function(x) {
  out <- data.frame(id = c(1:n), 
                    type = sample(c("coffee", "americano", "espresso"),n, replace=T))
  names(out)[2] <- paste0(names(out)[2], x)
  out})

# transform dfs into data.tables:
dt_list <- lapply(df_list, function(x) {
  out <- as.data.table(x)
  setkey(out, "id")
  out
})

# set options to outer join for all methods:    
mymerge <- function(...) base::merge(..., by="id", all=T)
mydplyr <- function(...) dplyr::full_join(..., by="id")
myplyr <- function(...) plyr::join(..., by="id", type="full")
mydt <- function(...) merge(..., by="id", all=T)

# Compare:
microbenchmark(base = Reduce(mymerge, df_list),
               dplyr= Reduce(mydplyr, df_list),
               plyr = Reduce(myplyr, df_list),
               dt = Reduce(mydt, dt_list), times=50)

This gives the following results:

Unit: milliseconds
  expr       min        lq      mean    median        uq       max neval  cld
  base  944.0048  956.9049  974.8875  962.9884  977.6824 1221.5301    50   c 
 dplyr  316.5211  322.2476  329.6281  326.9907  332.6721  381.6222    50 a   
  plyr 2682.9981 2754.3139 2788.7470 2773.8958 2812.5717 3003.2481    50    d
    dt  537.2613  554.3957  570.8851  560.5323  572.5592  757.6631    50  b  

We can see that the two contestants are dplyr and data.table. Changing the sample size to 5e5 yields the following comparisons, showing that indeed data.table dominates. Note that I added this part after @BenBolker's suggestion.

microbenchmark(dplyr= Reduce(mydplyr, df_list),
               dt = Reduce(mydt, dt_list), times=50)

Unit: seconds
expr      min       lq     mean   median       uq      max neval cld
dplyr 34.48993 34.85559 35.29132 35.11741 35.66051 36.66748    50   b
   dt 10.89544 11.32318 11.61326 11.54414 11.87338 12.77235    50  a 
coffeinjunky
  • 11,254
  • 39
  • 57