r %dopar% nested loop not running in parallel

Question

I'm running a nested loop via using %dopar% to generate dummy dataset for experience purpose. References link: R nested foreach %dopar% in outer loop and %do% in inner loop

sample dataset

set.seed(123)
n = 10000 #number of unique IDs (10k as trial) , real data consits of 50k unique IDs
ID <- paste(LETTERS[1:8],sample(n),sep = "")
year <- c('2015','2016','2017','2018')
month <- c('1','2','3','4','5','6','7','8','9','10','11','12')

pre-defined library

library(foreach)  
library(data.table)
library(doParallel)

# parallel processing setting
cl <- makeCluster(detectCores() - 1)
registerDoParallel(cl)

Test 1: %dopar% script

system.time(
  output_table <- foreach(i = seq_along(ID), .combine=rbind, .packages="data.table") %:%
    foreach(j = seq_along(year), .combine=rbind, .packages="data.table") %:%
    foreach(k = seq_along(month), .combine=rbind, .packages="data.table") %dopar% {

    data.table::data.table(
      mbr_code = ID[i],
      year = year[j],
      month = month[k]
    )
  }
)
stopCluster(cl)

#---------#
# runtime #
#---------#
>    user  system elapsed 
> 1043.31   66.83 1171.08

Test 2: %do% script

system.time(
  output_table <- foreach(i = seq_along(ID), .combine=rbind, .packages="data.table") %:%
    foreach(j = seq_along(year), .combine=rbind, .packages="data.table") %:%
    foreach(k = seq_along(month), .combine=rbind, .packages="data.table") %do% {

    data.table::data.table(
      mbr_code = ID[i],
      year = year[j],
      month = month[k]
    )
  }
)
stopCluster(cl)

#---------#
# runtime #
#---------#
> user  system elapsed 
> 1101.85    1.02 1110.55

Expected output results

> view(output_table)

Problem

when i run on %dopar% i did monitor my machine's CPU performance using Resource Monitor and i noticed the CPUs are not fully utilised.

Question

I did try to run above script (test1 and test2) on my machine i5, 4 cores. But it seems like the run time for both %do% and %dopar% are closed to each other. It's my script design issue? My real data consists of 50k unique IDs, meaning will took very long time if running in %do%, how can i fully utilised my machine CPUs to reduce run time?

Related: https://stackoverflow.com/questions/7224938/can-rbind-be-parallelized-in-r — Hong Ooi, Feb 19 '19 at 07:40
It does look like you have things running on all CPUs, could that be a limit imposed by Windows on each of them? — Nakx, Feb 19 '19 at 07:42

Oliver · Accepted Answer · 2019-02-19T09:13:22.283

2

I believe you are seeing the initial overhead of the foreach package, as it copies and set up whatever is needed to run each of the loops correctly. After running your code for about 30 - 60 seconds my cpu's all bumped too full utilization until the code was finaly done.

That said, it does not explain why your code is so slow compared to %do% loops. I believe the sinner here is in how the foreach loop is applied, when you are trying to access data across all foreach loops. Basically if you dont .export the data you need, it will try to access the same data in several of the parallel sessions, and each session will have to wait while the other sessions are finished accessing their own data. This could likely be alleviated by exporting the data using the .export argument in the foreach. Personally i use other package to perform most of my parallization so i suggest testing this if this is what you want. This would come with a greater overhead however.

Faster methods:

Now as you are trying to create a dummy dataset, for which all combinations of certain columns are combined, there are way faster methods of obtaining this. A quick search on 'cross-join' will lead you to posts like this one.

For the data.table package it can be done extremely efficient and faster using the 'CJ' function. Simply

output <- CJ(ID, year, month)

will yield the result your nested loops are trying to create, using only about 0.07 seconds to perform the task.

edited Feb 19 '19 at 09:13

answered Feb 19 '19 at 09:01

Oliver

8,169
3
15
37

Hi @Oliver, thanks for comment! By applying your suggestion `CJ function` works very well on my cross join issue. You are right, i had the overhead issue especially when running parallel processing involve `rbind`, and glad that you give a important factors (missing `.export` statement) that lead to the overhead issue. :) – yc.koong Feb 19 '19 at 10:14
Hi @Oliver, may i know which parallization package you normally use? – yc.koong Feb 19 '19 at 10:16
No problem, i am glad i could help. For my own programming i most often use the parallel package by itself. Many problems can be solved with parApply or parLapply (if the output is none-matrix-like). In any situation however, one greatly benefits, if the problem can be split into several chunks and the parallization be split into these chunks, instead of having to iterate over many smaller problems. Each iteration adds a (in most cases tiny) bit a overhead, so having 4 processes performing all the work, instead of 100 doing the same in smaller chunks is better. – Oliver Feb 19 '19 at 19:01
For none-standard parallel processes using the `futures` (or `furr`), `promises`, `ipc` packages can help performing asynchronious parallization, although it becomes even more technical and requires some time studying the problem. I believe that foreach can utilize the iterators chunk-like structure to perform calculations in chunks of size equal to the number of cores or less, instead of doing so over a small iterators. However i am still to investigate this property, which might make it possible that the foreach package could be faster in some non-standard situations. – Oliver Feb 19 '19 at 19:03