10

I'm dealing with a problem that requires parallel computing for getting faster results than with a classic "for loop".

Here's the problem:

I need to generate linear models for 198135 outcome variables contained in dataframes within a list object. I have to store all beta and p values for each predictive variable in the models in a dataframe, and also their goodness-of-fit measures.

I wrote a functional "for loop" that does the task properly, but it takes more than 35 hours for finishing it. I know that R is using less than 20% of my 8-core CPU, and I would like to use it all. The problem is that I don't know how to transform my for loop in a foreach loop for taking advantage of parallel computing.

Here's some example code of my problem in a smaller scale:

library(tidyverse)
library(broom)

## Example data 

outcome_list <- list(as.data.frame(cbind(rnorm(32), dataframe_id = c(1))),
                     as.data.frame(cbind(rnorm(32), dataframe_id =  c(2))),
                     as.data.frame(cbind(rnorm(32), dataframe_id =  c(3)))) ## This represents my list of 198135 dataframes

mtcars <- mtcars #I will use the explanatory variables from here



## Below this line is my current solution with a for loop that works fine

x <- list()
results_df <- as.data.frame(cbind(dataframe_id = c(0), intercept = c(0),
                                b_mpg = c(0), p_mpg = c(0),
                                b_cyl = c(0), p_cyl = c(0),
                                p.model = c(0), AIC = c(0),
                                BIC = c(0)))

for(i in 1:3){
  x[[i]] <- lm(outcome_list[[i]]$V1 ~ mtcars$mpg + mtcars$cyl)
  gof <- broom::glance(x[[i]])
  betas <- broom::tidy(x[[i]])
  results_df <- rbind(results_df, c(outcome_list[[i]]$V2[1], 
                                    betas$estimate[1],
                                    betas$estimate[2], betas$p.value[2], 
                                    betas$estimate[3], betas$p.value[3],
                                    gof$p.value, gof$r.squared, gof$AIC,
                                    gof$BIC))

  if(i %% i == 0){
    message(paste(i, "of 3")) # To know if my machine has not crashed
    x <- list() # To keep RAM clean of useless data
  }
  gc()
}

results_df <- results_df[-1, ]



With the code shown above I get the results that I need (a dataframe with regression parameters and goodness of fit for each outcome variable from the list), but it is very slow because I'm not able to use all of my computer power.

I know that using "foreach" and "doParallel" packages I can solve this problem in a faster way, but I still don't understand the logic behind foreach loops structure, since it's the first time I need to process so many data.

PS: I've already tried several ways with foreach function but I didn't get anywhere. I didn't write my foreach atempts of solutions because I'm not understanding what I'm doing.

crlagos0
  • 169
  • 1
  • 2
  • 7
  • 1
    Part of the reason your code is so slow is that you add new elements onto `x` and `results_df` with each iteration. This means R has to keep reallocating memory as they grow. Only output the new row with each loop iteration. – Frank Apr 04 '19 at 02:11

1 Answers1

6

You can do:

## Example data 
outcome_list <- list(as.data.frame(cbind(rnorm(32), dataframe_id = c(1))),
                     as.data.frame(cbind(rnorm(32), dataframe_id = c(2))),
                     as.data.frame(cbind(rnorm(32), dataframe_id = c(3))))

## Parallel code
library(doParallel)
registerDoParallel(cl <- makeCluster(3))
results_list <- foreach(i = 1:3) %dopar% {

  mylm <- lm(outcome_list[[i]]$V1 ~ mtcars$mpg + mtcars$cyl)
  gof <- broom::glance(mylm)
  betas <- broom::tidy(mylm)

  c(outcome_list[[i]]$V2[1], 
    betas$estimate[1],
    betas$estimate[2], betas$p.value[2], 
    betas$estimate[3], betas$p.value[3],
    gof$p.value, gof$r.squared, gof$AIC,
    gof$BIC)
}
stopCluster(cl)

results_df <- setNames(as.data.frame(do.call("rbind", results_list)),
                       c("dataframe_id", "intercept", "b_mpg", "p_mpg", 
                         "b_disp", "p_disp", "p.model", "AIC", "BIC"))

Your return your results in foreach (that works like lapply) instead of growing an object (which is not possible in parallel BTW).

Learn more on how to use foreach there.

F. Privé
  • 11,423
  • 2
  • 27
  • 78