I created a function using sapply
to run 153k linear regressions, and extract the estimates, se, pvalues, and the associated rowname from the first column. The code should be self-explanatory:
run_lms <-sapply(1:nrow(test_wide[,-1]), function(x) {
lm_output<-lm(unlist(test_wide[x,-1]) ~ survey_clean_for_lm$CR + survey_clean_for_lm$cbage + survey_clean_for_lm$sex + survey_clean_for_lm$bmistrat + survey_clean_for_lm$deidsite + survey_clean_for_lm$snppc1 + survey_clean_for_lm$snppc2 + survey_clean_for_lm$snppc3 + survey_clean_for_lm$methpc1 + survey_clean_for_lm$methpc2 + survey_clean_for_lm$methpc3 + survey_clean_for_lm$methpc4 + survey_clean_for_lm$methpc5 + survey_clean_for_lm$methpc6 + survey_clean_for_lm$methpc7 )
lm_summary <-summary(lm_output)
estimate <-lm_summary$coefficients[2,1]
se <-lm_summary$coefficients[2,2]
pval <-lm_summary$coefficients[2,4]
bind_cols(test_wide[x,1], estimate, se, pval)
}
)
It takes nearly 10 hours to run 153k regressions and store the output. I'm wondering if anyone has advice for speeding this up. I think the bind_cols()
portion is part of the problem, but I'm not sure how else to structure and save the output. Ultimately, I want this format:
probe estimate se pval
<chr> <dbl> <dbl> <dbl>
1 cg20272595 0.00556 0.00135 0.0000600
2 cg13995374 0.00466 0.00114 0.0000654
3 cg05254132 0.00367 0.000897 0.0000658
4 cg10049251 -0.00727 0.00179 0.0000746
5 cg19695507 -0.0108 0.00274 0.000117
6 cg21590616 0.00687 0.00176 0.000136
7 cg04089674 -0.00718 0.00186 0.000158
8 cg16907093 -0.00506 0.00132 0.000184
9 cg04600792 -0.00593 0.00156 0.000193
10 cg10529757 0.0122 0.00322 0.000199
# … with 151,853 more rows