1

So I'm have a large list of combinations from a data that I'm running a simple lm regression on, however the combination list is very long and it takes a long time to run all the lm for each list. I googling and came up upon the package parallel and beginning to understand mclapply, but then realizing that it doesn't work for windows. Then i came upon future.apply::future_lapply

So basically this is part of my function that is the slowest, which is:

regression<- combinations %>% 
    apply(1, paste, collapse = " + ") %>% 
    gsub(pattern = " \\+ NA", replacement = "", x = .) %>% 
    paste(Y-variable, "~", .)
  fmla_nocons <- paste(fmla, "- 1")
  
  # run lm models
  model <-  lapply(fmla_nocons, function(x) lm(x, data = df))

My combinations is a basically a list that looks like :

var 1       var 2      var 3 
variable 1 variable 2 variable 3 
variable 2 variable 3 variable 4 
...           ...      ...

This is a very long list so the first step is making it all y~ variable 1+variable2+ variable 3 and the second step is using lapply to run lm regression on all the different combinations.

However I researching using future_lapply will run it on a multicore system (Correct me if i misunderstood), will there also be clusters that similar with mclappy or is it as simple as replacing lapply (data, function(x) lm(x, data=df)) to future_lapply(data, function(x)lm(x, data=df)))?

Any feedback or input will be helpful and thanks for your time!

Michael
  • 59
  • 5

1 Answers1

2

Full disclosure, I don't know anything about future_apply. That said, most high capacity clusters run linux, so mclapply will work just fine there (is that what you meant by cluster?).

Without your actual data, I can't really test anything, but an alternative to both mclapply and future_apply is the parLapply function, which is basically meant for your situation (i.e., can't use a FORK, only a PSOCK cluster).

Some code to point you in that direction:

library(parallel)

ncpus = parallel::detectCores()-1
cl = makeCluster(ncpus, type="PSOCK")

out = parLapply( cl, list(), function() ) 

stopCluster(cl)

And, given that you know how lapply/mclapply works, you'll know that parLapply does the same thing as mclapply but with a different back-end.

John
  • 312
  • 1
  • 8
  • hi john! thanks for your reply, but I actually am just looking into mclapply and not sure the fundamentals, can you elaborate more on what PSOCK cluster are? does stopCluster mean stop parallel processing? Thanks! – Michael Sep 18 '20 at 06:07
  • 1
    I don't think I can really explain the differences very well, but [here's a link](https://www.r-bloggers.com/parallel-r-socket-or-fork/) (in addition to the one in my answer). I merely meant that you know that they apply a function over a list, rather than knowing how the parallel processing itself works. That said, stopCluster tells R to stop the connection to other cpus when you're done with processing, which is good for preserving CPU health. – John Sep 18 '20 at 18:28
  • gotchu, thanks for the link too! right when i run my function its also taking up alot of my cpu ill look into that! thanks again – Michael Sep 19 '20 at 06:56