I want to write a loop in R to run multiple regressions with one dependent variables and two lists of independent variables (all continuous variables). The model is additive and the loop should run by iterating through the two lists of variables so that it takes the first column from the first list + the first column from the second list, then the same for the second column in the two lists etc. The problem is I can't get it to iterate through the lists properly, instead my loop runs more models than it should.
The dataframe I am describing here is just a subset I will actually have to run this 3772 times (I am working on RNA-seq transcript expression).
My dataframe is called dry, and contains 22 variables (columns) and 87 observations (rows). Column 1 contains genotypeIDs, column 2:11 contains one set of independent variables to iterate through, column 12:21 contains a second set of independent variables to iterate through, and column 23 contains my dependent variable called FITNESS_DRY. This is what the structure looks like:
str(dry)
'data.frame': 87 obs. of 22 variables:
$ geneID : Factor w/ 87 levels "e10","e101","e102",..: 12 15 17 24 25 30 35 36 38 39 ...
$ RDPI_T1 : num 1.671 -0.983 -0.776 -0.345 0.313 ...
$ RDPI_T2 : num -0.976 -0.774 -0.532 -1.137 1.602 ...
$ RDPI_T3 : num -0.197 -0.324 0.805 -0.701 -0.566 ...
$ RDPI_T4 : num 0.289 -0.92 1.117 -1.214 -0.447 ...
$ RDPI_T5 : num -0.671 1.963 NA -1.024 -0.295 ...
$ RDPI_T6 : num 2.606 -1.116 -0.383 -0.893 0.119 ...
$ RDPI_T7 : num -0.843 -0.229 -0.297 0.504 -0.712 ...
$ RDPI_T8 : num -0.227 NA NA -0.816 -0.761 ...
$ RDPI_T9 : num 0.754 -1.304 1.867 -0.514 -1.377 ...
$ RDPI_T10 : num 1.1352 -0.1028 -0.69 2.0242 -0.0925 ...
$ DRY_T1 : num 0.6636 -0.64508 -0.24643 -1.43231 -0.00855 ...
$ DRY_T2 : num 1.008 0.823 -0.658 -0.148 0.272 ...
$ DRY_T3 : num -0.518 -0.357 1.294 0.408 0.771 ...
$ DRY_T4 : num 0.0723 0.2834 0.5198 1.6527 0.4259 ...
$ DRY_T5 : num 0.1831 1.9984 NA 0.0923 0.1232 ...
$ DRY_T6 : num -1.55 0.366 0.692 0.902 -0.993 ...
$ DRY_T7 : num -2.483 -0.334 -1.077 -1.537 0.393 ...
$ DRY_T8 : num 0.396 NA NA -0.146 -0.468 ...
$ DRY_T9 : num -0.694 0.353 2.384 0.665 0.937 ...
$ DRY_T10 : num -1.24 -1.57 -1.36 -3.88 -1.4 ...
$ FITNESS_DRY: num 1.301 3.365 0.458 0.346 1.983 ...
The goal is to run 10 multiple regressions looking like this:
lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
and so forth iterating through all ten columns for both lists This is equivalent to the following in terms of indexing
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
etc.
My loop should then calculate summaries for each model, and combine all the pvalues (4th column of the lm summary) in an output object.
I first defined my variable lists
var_list<-list(
var1=dry[,12:21],
var2=dry[,2:11]
)
This is the loop I tried which doesn't work properly:
lm.test1<-name<-vector()
for (i in 12:length(var_list$var1)){
for (j in 2:length(var_list$var2)){
lm.tmp<-lm(FITNESS_DRY~dry[,i]+dry[,j], na.action=na.omit, data=dry)
sum.tmp<-summary(lm.tmp)
lm.test1<-rbind(lm.test1,sum.tmp$coefficients[,4]) }
}
The loop returns this error message:
Warning message:
In rbind(lm.test6, sum.tmp$coefficients[, 4]) :
number of columns of result is not a multiple of vector length (arg 2)
I can call up the object "lm.test1", but that object has 27 lines instead of the 10 I want, so the iterations are not working properly here. Can anyone help with this please? Also, it would be great if I could add the names of my columns for each list of variables into the summary. I have tried using this for each variable list but without succes:
name<-append(name, as.character(colnames(var_list$var1))
Any ideas? Thanks in advance for any help!
UPDATE1: More information on the full data set: My actual data will still contain a first colum "geneID", then I have 3772 columns names DRY_T1....DRY_T3772, then another 3772 columns names RDPI_T1...RDPI_T3772, and finally my dependent variable "FITNESS_DRY". I still want to run all additive models as such:
lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
lm3772<-lm(FITNESS_DRY~DRY_T3772+RDPI_T3772)
I simulated a dataset as such:
set.seed(2)
dat3 = as.data.frame(replicate(7544, runif(20)))
names(dat3) = paste0(rep(c("DRY_T","RDPI_T"),each=3772), 1:3772)
dat3 = cbind(dat3, FITNESS_DRY=runif(20))
I then run the for loop:
models = list()
for(i in 1:3772) {
vars = names(dat3)[grepl(paste0(i,"$"), names(dat3))]
models2[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse="
+")),
data = dat3)
}
This works fine on the data simulation, but when I try it on my real dataset that is set up exactly in the same way it doesn't work. The loop is probably having issues handling numbers with two or more digits. I get this error message:
Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) :
0 (non-NA) cases
UPDATE 2: Indeed the model had issues handling numbers with two or more digits. To see how things go wrong in the original version I used this: (my dataset is called "dry2"):
names(dry2)[grepl("2$", names(dry2))]
This returned all DRY_T and RDPI_T variables with numbers containing "2" instead of just one pair of DRY_T and RDPI_T.
To fix the issue this new code works:
models = list()
for(i in 1:3772) {
vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
data = dry2)
}