I'm partitioning a data frame with split()
in order to use parLapply()
to call a function on each partition in parallel. The data frame has 1.3 million rows and 20 cols. I'm splitting/partitioning by two columns, both character type. Looks like there are ~47K unique IDs and ~12K unique codes, but not every pairing of ID and code are matched. The resulting number of partitions is ~250K. Here is the split()
line:
system.time(pop_part <- split(pop, list(pop$ID, pop$code)))
The partitions will then be fed into parLapply()
as follows:
cl <- makeCluster(detectCores())
system.time(par_pop <- parLapply(cl, pop_part, func))
stopCluster(cl)
I've let the split()
code alone run almost an hour and it doesn't complete. I can split by the ID alone, which takes ~10 mins. Additionally, R studio and the worker threads are consuming ~6GB of RAM.
The reason I know the resulting number of partitions is I have equivalent code in Pentaho Data Integration (PDI) that runs in 30 seconds (for the entire program, not just the "split" code). I'm not hoping for that type of performance with R, but something that perhaps completes in 10 - 15 mins worst case.
The main question: Is there a better alternative to split? I've also tried ddply()
with .parallel = TRUE
, but it also ran over an hour and never completed.