How can I run R code in batch and parallel processing in this situation?

Question

I need to analyse a group of clients, say I've got 2783 clients. I've got a code R written for a generic client, and I've got all the data that the program needs to calculate different variables in a database linked to the workspace. The code must be run sequentially since there are many dependent variables that build on each other. Each run of a client takes about 1 minute to run. I know I've got 8 logical processors in my computer and R only uses 1 unless run in parallel.

The issue I haven't found an answer in the internet yet, is that I need to send via a batch file: client 1 to the first processor.... client 8 to the eighth processor... and only when one processor is done, write a log file with some specifics about the run itself and move on to the next client, say processor 1 when finishes with client 1, move on to client 9 (since the other 7 processors have started with the remaining first 7 clients on the batch list).

A given processor must start and finish a client that has picked up because of what I mentioned. And each week I have a similar amount of clients to analyse.

So it would be a problem of batch processing R code and in parallel to maximise the computer's processing power.

At this rate, to run all the client extract of about 2800 people I'd need almost 2 days working around the clock! Just using the 8 cores would reduce this amount of time by around 88% to approximately 6 hours, and in batch processing, even if it takes 6 hours, they would be 6 hours in which I can focus on doing other work.

Thanks in advance!

You might want to take a look at `future.apply`[https://www.r-bloggers.com/future-apply-parallelize-any-base-r-apply-function/] in you're using the apply family of operators, or `furrr`[https://github.com/DavisVaughan/furrr] if you use the tidyverse. — Dom, Jul 31 '20 at 12:25
The first link redirects to R bloggers, but not to a specific entry to an issue like this. The second link is broken. — juan diluca haltrich, Jul 31 '20 at 12:51

Waldi · Answer 1 · 2021-03-14T07:40:29.913

0

As mentionned in the comments, furrr is a practical choice building on Tidyverse's purrr, with the multiprocessiong capabilities of future:

library(furrr)
library(dplyr)
plan(multisession, workers = 3) 
nbrOfWorkers() 
#[1] 3

clients <- as.list(1:9)
system.time(
results <- clients %>% future_map(~{
  client <- .x
  cat("processing client",client,"\n")
  # Long processing
  # source('ScripttoProcess.R')
  Sys.sleep(5)
  # Save results
  paste("Client ",client," results")
},.progress=T)
)

Progress: ──────────────────────────────────────────────────────────────── 100%

processing client 1 
processing client 2 
processing client 3 
processing client 4 
processing client 5 
processing client 6 
processing client 7 
processing client 8 
processing client 9 

       User      System       Total 
       0.09        0.00       15.47

edited Mar 14 '21 at 07:40

answered Jul 31 '20 at 12:43

Waldi

39,242
6
30
78

Thanks for the reply. Although I don't think it resolves completely the issue. First of all, I don't use tidyverse so could you explain the last part of your code? How do you check client 1 was run on logical processor 1, client 2 on processor 2, and so on up to client 9 on processor 1 again? More over, this doesn't address the issue of sending this code through a batch file, does it? – juan diluca haltrich Jul 31 '20 at 13:13
see my edit : you just need to source the processing script into the map function and return its results at the end of it : all results are returned as a list. Each processor will process fully one client and move to the next one when script is finished. – Waldi Jul 31 '20 at 13:23
@juan, did you find a solution to your question? – Waldi Aug 12 '20 at 14:05
Hello, not really. I moved to migrating the work on C, since a. the program will be faster and b. implementing a parallel processing environment is much more suited for a language like C than R. At least, there are far more options to implementing it. – juan diluca haltrich Sep 23 '20 at 20:41
I appreciate your reply, but I don't like using that R style of programming with pipes and that library in general, it is a higher level and thus less speed efficient. When programming in R I do it in its Base form. But even then, I moved the code to C since R itself was not fast enough ultimately. I can't afford the whole program to run in more than 2 hours. – juan diluca haltrich Sep 23 '20 at 20:44
Thanks for your feedback, this is your choice ;). I like functional programming combined with pipes because I find it short and elegant. When parallelizing is needed, I doubt that the overhead of `furrr` makes a difference in the 1 minute you needed to process 1 client. Going to C can of course [make a big difference](https://stackoverflow.com/a/62841712/13513328)! – Waldi Sep 23 '20 at 21:06

How can I run R code in batch and parallel processing in this situation?

1 Answers1