I'm dealing with a problem that requires parallel computing for getting faster results than with a classic "for loop".
Here's the problem:
I need to generate linear models for 198135 outcome variables contained in dataframes within a list object. I have to store all beta and p values for each predictive variable in the models in a dataframe, and also their goodness-of-fit measures.
I wrote a functional "for loop" that does the task properly, but it takes more than 35 hours for finishing it. I know that R is using less than 20% of my 8-core CPU, and I would like to use it all. The problem is that I don't know how to transform my for loop in a foreach loop for taking advantage of parallel computing.
Here's some example code of my problem in a smaller scale:
library(tidyverse)
library(broom)
## Example data
outcome_list <- list(as.data.frame(cbind(rnorm(32), dataframe_id = c(1))),
as.data.frame(cbind(rnorm(32), dataframe_id = c(2))),
as.data.frame(cbind(rnorm(32), dataframe_id = c(3)))) ## This represents my list of 198135 dataframes
mtcars <- mtcars #I will use the explanatory variables from here
## Below this line is my current solution with a for loop that works fine
x <- list()
results_df <- as.data.frame(cbind(dataframe_id = c(0), intercept = c(0),
b_mpg = c(0), p_mpg = c(0),
b_cyl = c(0), p_cyl = c(0),
p.model = c(0), AIC = c(0),
BIC = c(0)))
for(i in 1:3){
x[[i]] <- lm(outcome_list[[i]]$V1 ~ mtcars$mpg + mtcars$cyl)
gof <- broom::glance(x[[i]])
betas <- broom::tidy(x[[i]])
results_df <- rbind(results_df, c(outcome_list[[i]]$V2[1],
betas$estimate[1],
betas$estimate[2], betas$p.value[2],
betas$estimate[3], betas$p.value[3],
gof$p.value, gof$r.squared, gof$AIC,
gof$BIC))
if(i %% i == 0){
message(paste(i, "of 3")) # To know if my machine has not crashed
x <- list() # To keep RAM clean of useless data
}
gc()
}
results_df <- results_df[-1, ]
With the code shown above I get the results that I need (a dataframe with regression parameters and goodness of fit for each outcome variable from the list), but it is very slow because I'm not able to use all of my computer power.
I know that using "foreach" and "doParallel" packages I can solve this problem in a faster way, but I still don't understand the logic behind foreach loops structure, since it's the first time I need to process so many data.
PS: I've already tried several ways with foreach function but I didn't get anywhere. I didn't write my foreach atempts of solutions because I'm not understanding what I'm doing.