0

I have a list with two data.table objects in it. To give an idea, one table got 400,000 rows & 7 variables, other got 750,000 rows & 12 variables. Those two tables don't have same columns. I do a lot of munging (different steps for each) on them. The munging steps include calculating sum, finding percentile for a summary value, number of cases in each group, unique length, etc (more than 20 steps on each). I use data.table package for these steps. However, I see that doing all ~20 steps for each (>40 steps in total) takes a bit of time. I am wondering how I can use parallel processing to speed this up. I assume it is possible to process these steps in parallel as they are carried out on different components of a list. I did a thorough google search to brainstorm ideas,however, I didn't find helpful leads. Has anyone done this? Please shed some light, so I will be very grateful. Thank you

So far, I have done this much. Result is the list containing two data.table objects. fun1 & fun2 are wrapped up set of steps I need to do on each data.table object. Performance wise I don't see any gain yet (probably due to overhead? I dont know).

munge_data<-function(fun1=prep_data1, fun2=prep_data2, result=result){
  library(foreach)
  library(doParallel)
  cl <- makeCluster(2)
  registerDoParallel(cl)

  fun_munge<-list(quote(prep_data1(result)), quote(prep_data2(result)))

  finatest<-foreach (i=1:2, .packages = "data.table") %dopar% {
    result<-result
    prep_data1<-fun1
    prep_data2<-fun2
    eval(fun_munge[[i]])
  }
  stopCluster(cl)
  finatest
}
tmthydvnprt
  • 10,398
  • 8
  • 52
  • 72
JeanVuda
  • 1,738
  • 14
  • 29
  • IMO not a duplicated as OP is asking to parallel processing on multiple separated data.tables, parallel should works then. BTW. minimal example would be nice. – jangorecki Jul 03 '15 at 07:19
  • `parallel` + `mclapply` ? – Arun Jul 03 '15 at 09:41
  • Yeah. I couldn't get myself come with an idea to crack this. If anyone know something, please share some idea. – JeanVuda Jul 03 '15 at 09:46
  • @jangorecki, [and Arun] Yeah, I have two data.table objects, and a set of separate munging steps for each object. There is no need to process them one after one. I am wondering how to apply the parallel processing here. – JeanVuda Jul 04 '15 at 06:49
  • @JeanVuda make a `a` list of data.tables, make a `b` list of corresponding transformation to apply (functions or unevaluated expressions). Provide them to `mcmapply`. – jangorecki Jul 04 '15 at 11:54
  • You may not see a gain as the transformations using data.table are blazingly fast. Setting up workers and collecting results may slow the process. If you don't need all those data together you may run multiple R sessions from shell and process them in parallel but separately. – jangorecki Jul 04 '15 at 12:00
  • @jangorecki, yeah I am following same technique. My list is `result`, and transformation is in `fun_munge` list. I used `foreach` package's `%dopar%`. I will test `mcmapply`. Does it work on windows OS? – JeanVuda Jul 04 '15 at 12:17
  • 1
    @JeanVuda I think it not work or at least not work out of the box, I've seen a post how to make `parallel` works on windows but `foreach` seems to be much easier. – jangorecki Jul 04 '15 at 19:03

0 Answers0