-1

I want to write a loop on R to perform linear regression on my dataset genes (= 210011 genes and 6 samples total; with columns the gene and the rows the samples) to identify how age and sex affect gene expression. I want to save the fitted value output from the linear regression in a new dataframe (generating basically a similar dataframe where on the columns there are the genes and in the rows the samples).

so the loop I wrote is:

genelist <- df %>% select(5:21011) #select only genes
for (i in 1:length(genelist)) {
  formula <- as.formula(paste0(genelist[i], ' ~ age + sex'))
  model <- lm(formula, data = df)
  print(model$fitted.values)
}

but I am not able to save a new dataframe. I tried to follow this How to Loop/Repeat a Linear Regression in R

test <- list(); model <- list()
for (i in 1:length(genelist)) {
  formula[[i]] = paste0(genelist[i], ' ~ age + sex')
  model[[i]] = lm(formula[[i]], data = df) 
}

but it gives me "list of 0" as results so I must have written something wrong. How can I modif my original code generating a new dataframe with the results?

thanks for help help!

cb_1990
  • 3
  • 1

2 Answers2

2

Create a vector of column names instead of column values. Try the following :

names_vec <- names(genelist)
formula <- vector('list', length(names_vec))
model <- vector('list', length(names_vec))

for (i in seq_along(names_vec)) {
  formula[[i]] = paste0(genelist[i], ' ~ age + sex')
  model[[i]] = lm(formula[[i]], data = df) 
}
Ronak Shah
  • 377,200
  • 20
  • 156
  • 213
2

Here's an example that works:

library(dplyr)
set.seed(2053)
df <- data.frame(age = sample(18:80, 6, replace=FALSE), 
                 sex = sample(0:1, 6, replace=TRUE))
for(i in 1:10){
  df[[paste0("gene_", i)]] <- runif(6,0,1)
}

genelist <- df %>% select(3:12) #select only genes
pred <- df %>% select(age, sex)
sigs <- NULL
for (i in 1:length(genelist)) {
  formula <- reformulate(c("age", "sex"), response=names(genelist)[i])  
  model <- lm(formula, data = df)
  pred[[names(genelist)[i]]] <- predict(model, newdata=pred)
  pvals <- summary(model)$coefficients[-1,4]
  pvals <- c(pvals, "F" = unname(pf(summary(model)$fstatistic[1], 
                             summary(model)$fstatistic[2], 
                             summary(model)$fstatistic[3], 
                             lower.tail=FALSE)))
  sigs <- rbind(sigs, pvals)
}
rownames(sigs) <- colnames(genelist)

pred
#>   age sex    gene_1    gene_2      gene_3    gene_4      gene_5    gene_6
#> 1  54   0 0.6460394 0.7975062 0.542963150 0.5766314  0.43716321 0.3731399
#> 2  65   0 0.4969311 0.7557411 0.499976012 0.7201710 -0.02954846 0.3392473
#> 3  49   0 0.7138160 0.8164903 0.562502758 0.5113862  0.64930488 0.3885457
#> 4  62   0 0.5375970 0.7671316 0.511699777 0.6810238  0.09773654 0.3484907
#> 5  44   0 0.7815925 0.8354744 0.582042366 0.4461409  0.86144655 0.4039515
#> 6  40   1 0.3976764 0.3673542 0.009429805 0.2500409  0.38185899 0.5017752
#>      gene_7    gene_8     gene_9   gene_10
#> 1 0.6990817 0.6336038 0.36330413 0.3146205
#> 2 0.6414371 0.8336259 0.58575121 0.2651734
#> 3 0.7252838 0.5426847 0.26219181 0.3370964
#> 4 0.6571584 0.7790744 0.52508383 0.2786590
#> 5 0.7514859 0.4517656 0.16107950 0.3595723
#> 6 0.1903702 0.9501972 0.09472406 0.6118369

sigs
#>                  age         sex          F
#> gene_1  0.5736844190 0.462726187 0.72223654
#> gene_2  0.6526877593 0.079265450 0.12862783
#> gene_3  0.8493679497 0.289034889 0.44026028
#> gene_4  0.6573230145 0.837650010 0.73697690
#> gene_5  0.0004498121 0.001752855 0.00105546
#> gene_6  0.8812687531 0.864282218 0.92669812
#> gene_7  0.7960359480 0.286213514 0.45291886
#> gene_8  0.2845571231 0.190629747 0.35754969
#> gene_9  0.1456621812 0.957422081 0.19649049
#> gene_10 0.7587574987 0.521719609 0.54628975

Created on 2022-10-18 by the reprex package (v2.0.1)

In the example above, pred is the answer to the original question. In the comments, you asked about identifying whether the models were significant. The object sigs captures the p-values from both the age and sex variables as well as the p-value for the model's omnibus F-statistic.

DaveArmstrong
  • 18,377
  • 2
  • 13
  • 25
  • naive question in your solution you are generating the model in ` model <- lm(formula, data = df)` and then you are using that to predict? but is there a way to know if the first linear model which are significant or not? – PesKchan Oct 17 '22 at 19:43
  • What do you mean by significant? A significant omnibus F statistic for the model? A significant coefficient on one of the variables? – DaveArmstrong Oct 17 '22 at 23:16
  • yes if age or sex or both age:sex are good predictors ..in that context i was asking – PesKchan Oct 18 '22 at 06:43
  • I added to the example a way to capture the p-values for the model coefficients and the F-statistic. – DaveArmstrong Oct 18 '22 at 10:28
  • thank you very much this i will use as template for my analysis – PesKchan Oct 24 '22 at 22:08