parallel regression in R (maybe with snowfall)

Question

I'm trying to run R in parallel to run a regression. I'm trying to use the snowfall library (but am open to any approach). Currently, I'm running the following regression which is taking an extremely long time to run. Can someone show me how to do this?

 sales_day_region_ctgry_lm <- lm(log(sales_out+1)~factor(region_out) 
             + date_vector_out + factor(date_vector_out) +
             factor(category_out) + mean_temp_out)

I've started down the following path:

library(snowfall)
sfInit(parallel = TRUE, cpus=4, type="SOCK")

wrapper <- function() {
return(lm(log(sales_out+1)~factor(region_out) + date_vector_out +
               factor(date_vector_out) + factor(category_out) +   mean_temp_out))
}

output_lm <- sfLapply(*no idea what to do here*,wrapper)
sfStop()
summary(output_lm)

But this approach is riddled with errors.

Thanks!

Doing this will get you the same model repeated 4 times, not the one model fitted in 1/4th the time. — Hong Ooi, Mar 11 '16 at 05:22
If `lm` takes a long time that means your design matrix is huge, i.e., you have many factor levels. I'm also a bit skeptical if the transformation you are employing is the most appropriate way to go. Consider carefully if ordinary least squares regression is the best method to achieve whatever your goal is. — Roland, Mar 11 '16 at 08:26
In particular, including a variable both as a continuous predictor and as a factor predictor seems ... let's call it *strange* .... — Roland, Mar 11 '16 at 08:40

score 3 · Answer 1 · answered Jul 10 '17 at 18:33

The partools package offers an easy, off-the-shelf implementation of parallelised linear regression via its calm() function. (The "ca" prefix stands for "chunk averaging".)

In your case -- leaving aside @Roland's correct comment about mixing up factor and continuous predictors -- the solution should be as simple as:

library(partools)
#library(parallel) ## loads as dependency

cls <- makeCluster(4) ## Or, however many cores you want/have.

sales_day_region_ctgry_calm <- 
  calm(
    cls, 
    "log(sales_out+1) ~ factor(region_out) + date_vector_out + 
     factor(date_vector_out) + factor(category_out) + mean_temp_out, 
     data=YOUR_DATA_HERE"
    )

Note that the model call is described within quotation marks. Note further that you may need to randomise your data first if it is ordered in any way (e.g. by date.) See the partools vignette for more details.

score 2 · Answer 2 · answered Mar 11 '16 at 05:29

Since you're fitting one big model (as opposed to several small models), and you're using linear regression, a quick-and-easy way to get parallelism is to use a multithreaded BLAS. Something like Microsoft R Open (previously known as Revolution R Open) should do the trick.*

* disclosure: I work for Microsoft/Revolution.

parallel regression in R (maybe with snowfall)

2 Answers2

Linked