I have a list with two data.table objects in it. To give an idea, one table got 400,000 rows & 7 variables, other got 750,000 rows & 12 variables. Those two tables don't have same columns. I do a lot of munging (different steps for each) on them. The munging steps include calculating sum, finding percentile for a summary value, number of cases in each group, unique length, etc (more than 20 steps on each). I use data.table
package for these steps. However, I see that doing all ~20 steps for each (>40 steps in total) takes a bit of time. I am wondering how I can use parallel processing to speed this up. I assume it is possible to process these steps in parallel as they are carried out on different components of a list. I did a thorough google search to brainstorm ideas,however, I didn't find helpful leads. Has anyone done this? Please shed some light, so I will be very grateful. Thank you
So far, I have done this much. Result
is the list containing two data.table objects. fun1 & fun2 are wrapped up set of steps I need to do on each data.table object. Performance wise I don't see any gain yet (probably due to overhead? I dont know).
munge_data<-function(fun1=prep_data1, fun2=prep_data2, result=result){
library(foreach)
library(doParallel)
cl <- makeCluster(2)
registerDoParallel(cl)
fun_munge<-list(quote(prep_data1(result)), quote(prep_data2(result)))
finatest<-foreach (i=1:2, .packages = "data.table") %dopar% {
result<-result
prep_data1<-fun1
prep_data2<-fun2
eval(fun_munge[[i]])
}
stopCluster(cl)
finatest
}