for-loop linear regression generation new dataframe with the results

Question

I want to write a loop on R to perform linear regression on my dataset genes (= 210011 genes and 6 samples total; with columns the gene and the rows the samples) to identify how age and sex affect gene expression. I want to save the fitted value output from the linear regression in a new dataframe (generating basically a similar dataframe where on the columns there are the genes and in the rows the samples).

so the loop I wrote is:

genelist <- df %>% select(5:21011) #select only genes
for (i in 1:length(genelist)) {
  formula <- as.formula(paste0(genelist[i], ' ~ age + sex'))
  model <- lm(formula, data = df)
  print(model$fitted.values)
}

but I am not able to save a new dataframe. I tried to follow this How to Loop/Repeat a Linear Regression in R

test <- list(); model <- list()
for (i in 1:length(genelist)) {
  formula[[i]] = paste0(genelist[i], ' ~ age + sex')
  model[[i]] = lm(formula[[i]], data = df) 
}

but it gives me "list of 0" as results so I must have written something wrong. How can I modif my original code generating a new dataframe with the results?

thanks for help help!

score 2 · Answer 1 · answered Mar 06 '21 at 13:36

Create a vector of column names instead of column values. Try the following :

names_vec <- names(genelist)
formula <- vector('list', length(names_vec))
model <- vector('list', length(names_vec))

for (i in seq_along(names_vec)) {
  formula[[i]] = paste0(genelist[i], ' ~ age + sex')
  model[[i]] = lm(formula[[i]], data = df) 
}

DaveArmstrong · Accepted Answer · 2022-10-18T10:28:02.637

Here's an example that works:

library(dplyr)
set.seed(2053)
df <- data.frame(age = sample(18:80, 6, replace=FALSE), 
                 sex = sample(0:1, 6, replace=TRUE))
for(i in 1:10){
  df[[paste0("gene_", i)]] <- runif(6,0,1)
}

genelist <- df %>% select(3:12) #select only genes
pred <- df %>% select(age, sex)
sigs <- NULL
for (i in 1:length(genelist)) {
  formula <- reformulate(c("age", "sex"), response=names(genelist)[i])  
  model <- lm(formula, data = df)
  pred[[names(genelist)[i]]] <- predict(model, newdata=pred)
  pvals <- summary(model)$coefficients[-1,4]
  pvals <- c(pvals, "F" = unname(pf(summary(model)$fstatistic[1], 
                             summary(model)$fstatistic[2], 
                             summary(model)$fstatistic[3], 
                             lower.tail=FALSE)))
  sigs <- rbind(sigs, pvals)
}
rownames(sigs) <- colnames(genelist)

pred
#>   age sex    gene_1    gene_2      gene_3    gene_4      gene_5    gene_6
#> 1  54   0 0.6460394 0.7975062 0.542963150 0.5766314  0.43716321 0.3731399
#> 2  65   0 0.4969311 0.7557411 0.499976012 0.7201710 -0.02954846 0.3392473
#> 3  49   0 0.7138160 0.8164903 0.562502758 0.5113862  0.64930488 0.3885457
#> 4  62   0 0.5375970 0.7671316 0.511699777 0.6810238  0.09773654 0.3484907
#> 5  44   0 0.7815925 0.8354744 0.582042366 0.4461409  0.86144655 0.4039515
#> 6  40   1 0.3976764 0.3673542 0.009429805 0.2500409  0.38185899 0.5017752
#>      gene_7    gene_8     gene_9   gene_10
#> 1 0.6990817 0.6336038 0.36330413 0.3146205
#> 2 0.6414371 0.8336259 0.58575121 0.2651734
#> 3 0.7252838 0.5426847 0.26219181 0.3370964
#> 4 0.6571584 0.7790744 0.52508383 0.2786590
#> 5 0.7514859 0.4517656 0.16107950 0.3595723
#> 6 0.1903702 0.9501972 0.09472406 0.6118369

sigs
#>                  age         sex          F
#> gene_1  0.5736844190 0.462726187 0.72223654
#> gene_2  0.6526877593 0.079265450 0.12862783
#> gene_3  0.8493679497 0.289034889 0.44026028
#> gene_4  0.6573230145 0.837650010 0.73697690
#> gene_5  0.0004498121 0.001752855 0.00105546
#> gene_6  0.8812687531 0.864282218 0.92669812
#> gene_7  0.7960359480 0.286213514 0.45291886
#> gene_8  0.2845571231 0.190629747 0.35754969
#> gene_9  0.1456621812 0.957422081 0.19649049
#> gene_10 0.7587574987 0.521719609 0.54628975

^{Created on 2022-10-18 by the reprex package (v2.0.1)}

In the example above, pred is the answer to the original question. In the comments, you asked about identifying whether the models were significant. The object sigs captures the p-values from both the age and sex variables as well as the p-value for the model's omnibus F-statistic.

naive question in your solution you are generating the model in ` model <- lm(formula, data = df)` and then you are using that to predict? but is there a way to know if the first linear model which are significant or not? — PesKchan, Oct 17 '22 at 19:43
What do you mean by significant? A significant omnibus F statistic for the model? A significant coefficient on one of the variables? — DaveArmstrong, Oct 17 '22 at 23:16
yes if age or sex or both age:sex are good predictors ..in that context i was asking — PesKchan, Oct 18 '22 at 06:43
I added to the example a way to capture the p-values for the model coefficients and the F-statistic. — DaveArmstrong, Oct 18 '22 at 10:28
thank you very much this i will use as template for my analysis — PesKchan, Oct 24 '22 at 22:08

for-loop linear regression generation new dataframe with the results

2 Answers2