34

As I am sitting here waiting for some R scripts to run...I was wondering... is there any way to parallelize rbind in R?

I sitting waiting for this call to complete frequently as I deal with large amounts of data.

do.call("rbind", LIST)
Atlas1j
  • 2,442
  • 3
  • 27
  • 35
  • 5
    [rbind.fill](http://rgm2.lab.nig.ac.jp/RGM2/func.php?rd_id=plyr:rbind.fill) from plyr package advertises it operates substantially faster than base `rbind`. Maybe you'll see some performance bumps there. If you add some sample representative data, people can offer other solutions as well with benchmarks to test time. – Chase Aug 29 '11 at 00:39
  • 3
    What type of objects are in `LIST` (matrix, data.frame, etc.)? – Joshua Ulrich Aug 29 '11 at 02:17
  • What type of objects are in LIST? Data.frames. – Atlas1j Aug 30 '11 at 15:45

6 Answers6

24

I haven't found a way to do this in parallel either thus far. However for my dataset (this one is a list of about 1500 dataframes totaling 4.5M rows) the following snippet seemed to help:

while(length(lst) > 1) {
    idxlst <- seq(from=1, to=length(lst), by=2)

    lst <- lapply(idxlst, function(i) {
        if(i==length(lst)) { return(lst[[i]]) }

        return(rbind(lst[[i]], lst[[i+1]]))
    })
}

where lst is the list. It seemed to be about 4 times faster than using do.call(rbind, lst) or even do.call(rbind.fill, lst) (with rbind.fill from the plyr package). In each iteration this code is halving the amount of dataframes.

Dominik
  • 241
  • 3
  • 3
20

Because you said that you want to rbind data.frame objects you should use the data.table package. It has a function called rbindlist that enhance drastically rbind. I am not 100% sure but I would bet any use of rbind would trigger a copy when rbindlist does not. Anyway a data.table is a data.frame so you do not loose anything to try.

EDIT:

library(data.table)
system.time(dt <- rbindlist(pieces))
utilisateur     système      écoulé 
       0.12        0.00        0.13 
tables()
     NAME  NROW MB COLS                        KEY
[1,] dt   1,000 8  X1,X2,X3,X4,X5,X6,X7,X8,...    
Total: 8MB

Lightning fast...

statquant
  • 13,672
  • 21
  • 91
  • 162
18

I doubt that you can get this to work faster by parallellizing it: apart from the fact that you would probably have to write it yourself (thread one first rbinds item 1 and 2, while thread two rbinds items 3 and 4 etc., and when they're done, the results are 'rebound', something like that - I don't see a non-C way of improving this), it is going to involve copying large amounts of data between your threads, which is typically the thing that goes slow in the first place.

In C, you can share objects between threads, so then you could have all your threads write in the same memory. I wish you the best of luck with that :-)

Finally, as an aside: rbinding data.frames is just slow. If you know up front that the structure of all your data.frames is exactly the same, and it doesn't contain pure character columns, you can probably use the trick from this answer to one of my questions. If your data.frame contains character columns, I suspect that your best off handling these separately (do.call(c, lapply(LIST, "[[", "myCharColName"))) and then performing the trick with the rest, after which you can reunite them.

Community
  • 1
  • 1
Nick Sabbe
  • 11,684
  • 1
  • 43
  • 57
7

Here's a solution, it naturally extends to rbind.fill, merge, and other dataframe list functions:

But like with all my answers/questions verify :)

require(snowfall)
require(rbenchmark)

rbinder <- function(..., cores=NULL){
  if(is.null(cores)){
    do.call("rbind", ...)
  }else{
    sequ <- as.integer(seq(1, length(...), length.out=cores+1))
    listOLists <- paste(paste("list", seq(cores), sep=""), " = ...[",  c(1, sequ[2:cores]+1), ":", sequ[2:(cores+1)], "]", sep="", collapse=", ") 
    dfs <- eval(parse(text=paste("list(", listOLists, ")")))
    suppressMessages(sfInit(parallel=TRUE, cores))
    dfs <- sfLapply(dfs, function(x) do.call("rbind", x))
    suppressMessages(sfStop())
    do.call("rbind", dfs)   
  }
}

pieces <- lapply(seq(1000), function(.) data.frame(matrix(runif(1000), ncol=1000)))

benchmark(do.call("rbind", pieces), rbinder(pieces), rbinder(pieces, cores=4), replications = 10)

#test replications elapsed relative user.self sys.self user.child sys.child
#With intel i5 3570k    
#1     do.call("rbind", pieces)           10  116.70    6.505    115.79     0.10         NA        NA
#3 rbinder(pieces, cores = 4)           10   17.94    1.000      1.67     2.12         NA        NA
#2              rbinder(pieces)           10  116.03    6.468    115.50     0.05         NA        NA
Xachriel
  • 313
  • 2
  • 7
  • @Arun, it's a mistake that I already fixed, but seems that it didnt save. Also you can easily expand this to `rbindlist` or other list based techniques. I'm just using Rs standard `rbind` for easy proof of concept. – Xachriel Aug 03 '13 at 05:10
  • @Arun, that's not what I mean. I never argued that my parallelization solution would be the fastest solution. I just answered the question: "can rbind be parallelized?" with "yes" and the notion that you can use a this approach with different functions. With the `rbindlist` solution it would be nice to see does it get improved with parallellization. Probably not because reading libraries/functions "to the cores" takes more time then just binding few 10ks of frames. But what about 100k, 1M, 10M? I'm currently on my summer cottage with an old macbook. So I cannot try this properly until monday. – Xachriel Aug 03 '13 at 16:36
2

This is expanding on @Dominik 's answer.

We can use the mclapply from parallel package to increase the speed further. Also rbind.fill does a better job than rbind so here's the improved code. NOTE: this will only work on mac/linux. mclapply is not supported on Windows. EDIT: if you want to see progress, uncomment the print(i) line, and make sure you run from a terminal, not from RStudio. Printing to RStudio from a parallel process, kind of messes RStudio up.

library(parallel)
rbind.fill.parallel <- function(list){
  while(length(list) > 1) {
    idxlst <- seq(from=1, to=length(list), by=2)

    list <- mclapply(idxlst, function(i) {
      #print(i) #uncomment this if you want to see progress
      if(i==length(list)) { return(list[[i]]) }
      return(rbind.fill(list[[i]], list[[i+1]]))
    })
  }
}
Nikhil
  • 99
  • 3
  • 10
1

Looks like this has already been well answered by a number of people, but if it comes up for some one, here is a version of a parallel rbind for non-data.table/data.frame-esque objects:

rbind.parallel <- function(list,ncore)
  {
  library(parallel)
  do.call.rbind<-function(x){do.call(rbind,x)}
  cl<-makeCluster(ncore)
  list.split<-split(list,rep(1:ncore,length(list)+1)[1:length(list)])
  list.join<-parLapply(cl,list.split,do.call.rbind)
  stopCluster(cl)
  list.out<-do.call(rbind,list.join)
  return(list.out)
  }

This works effectively on sf type objects. For example, if you read in a list of shapefiles from a directory using lapply(.,st_read), obviously rbind.fill and its variants not going to work to join all the features.

SeldomSeenSlim
  • 811
  • 9
  • 24