How to loop thousands of linear regressions and extract the P-values

Question

I have one continuous variable with 76 observations and 29877 more continuous variables i would like to run linear regressions against. The code for one might look something like:

lm(data=spread_gene_exp, spread_gene_exp[,i] ~ spread_gene_exp[,2])

where i is one of the 29877 Y variables.

I would then like to extract the coefficient estimate and P-values from the results and add these to some kind of data frame where i can easily identify which variables were the most significant. This would probably need some kind of loop but i'm not sure where to start.

I would love to run this off of my own desktop but it will most likely take quite some time and so will probably have to run it on the university server, any estimation on how long it will take on a midrange/semi-powerful desktop would be helpful.

Related: https://stackoverflow.com/questions/27952653/how-to-loop-repeat-a-linear-regression-in-r — MrFlick, Feb 18 '19 at 22:59
Another possible duplicate: https://stackoverflow.com/questions/25036007/linear-regression-loop-for-each-independent-variable-individually-against-depend — MrFlick, Feb 18 '19 at 23:00
From a Bioinformatics point of view: Why reinvent the wheel? If this is about characterising differential gene expression there exist much more robust methods like `DESeq2`, `limma`, `edgeR` (and many more). — Maurits Evers, Feb 18 '19 at 23:01
im a student bioinformatician and wasn't aware of these packages, thanks for the help — chris wills, Feb 18 '19 at 23:08
No worries @chriswills; I recommend taking a look at the `limma` and `DESeq2` vignettes on Bioconductor; they give a lot of details involving the underlying statistical models. The origin of a lot of these methods date back to microarray times, and have been developed to robustly quantify changes in gene expression by e.g. estimating the mean-variance relationship from the expression of all genes first (data is usually not homoskedastic). Anyway, good luck with your studies! — Maurits Evers, Feb 18 '19 at 23:17
You might be better off using [stepwise regression](https://stats.stackexchange.com/questions/214682/stepwise-regression-in-r-how-does-it-work) to find the most relevant variables with linear regression or other measures such as variable importance for tree-based models such as random forests. — Lorenz Walthert, Feb 18 '19 at 23:58

How to loop thousands of linear regressions and extract the P-values

0 Answers0