0

I created a function using sapply to run 153k linear regressions, and extract the estimates, se, pvalues, and the associated rowname from the first column. The code should be self-explanatory:

run_lms <-sapply(1:nrow(test_wide[,-1]), function(x) {
  lm_output<-lm(unlist(test_wide[x,-1]) ~ survey_clean_for_lm$CR + survey_clean_for_lm$cbage + survey_clean_for_lm$sex + survey_clean_for_lm$bmistrat + survey_clean_for_lm$deidsite + survey_clean_for_lm$snppc1  + survey_clean_for_lm$snppc2 + survey_clean_for_lm$snppc3 + survey_clean_for_lm$methpc1 + survey_clean_for_lm$methpc2 + survey_clean_for_lm$methpc3 + survey_clean_for_lm$methpc4 + survey_clean_for_lm$methpc5 + survey_clean_for_lm$methpc6 + survey_clean_for_lm$methpc7 )
  lm_summary <-summary(lm_output)
  estimate <-lm_summary$coefficients[2,1]
  se <-lm_summary$coefficients[2,2]
  pval <-lm_summary$coefficients[2,4]
  bind_cols(test_wide[x,1], estimate, se, pval)
}
)

It takes nearly 10 hours to run 153k regressions and store the output. I'm wondering if anyone has advice for speeding this up. I think the bind_cols() portion is part of the problem, but I'm not sure how else to structure and save the output. Ultimately, I want this format:

   probe      estimate       se      pval
   <chr>         <dbl>    <dbl>     <dbl>
 1 cg20272595  0.00556 0.00135  0.0000600
 2 cg13995374  0.00466 0.00114  0.0000654
 3 cg05254132  0.00367 0.000897 0.0000658
 4 cg10049251 -0.00727 0.00179  0.0000746
 5 cg19695507 -0.0108  0.00274  0.000117 
 6 cg21590616  0.00687 0.00176  0.000136 
 7 cg04089674 -0.00718 0.00186  0.000158 
 8 cg16907093 -0.00506 0.00132  0.000184 
 9 cg04600792 -0.00593 0.00156  0.000193 
10 cg10529757  0.0122  0.00322  0.000199 
# … with 151,853 more rows
Calen
  • 305
  • 4
  • 17
  • 3
    I think you would get more help if you read about [creating a reproducible example](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example) and adjusted your question accordingly with an example data set and runnable code. This would allow for concrete suggestions that can be benchmarked. – LMc May 11 '22 at 20:11
  • As it stands now, since you are familiar with apply, I would recommend you look into [furrr](https://furrr.futureverse.org/) and leverage parallel processing to help improve the time it takes to run your code. – LMc May 11 '22 at 20:22
  • Thanks for your reply @LMc! I'm honestly not sure how to do that - ```dput()``` for 3 datasets would produce an obscenely long post even for a subset of the data. In response to your second point - I'm using AWS and it seems to be running in parallel (i.e. all 8 cores are working), but the more cores I add the less (proportionally) each one uses. Not sure why. Thought folks might have simple suggestions given the simplicity of my ```sapply()``` function. – Calen May 11 '22 at 23:22

0 Answers0