3

I want to write a loop in R to run multiple regressions with one dependent variables and two lists of independent variables (all continuous variables). The model is additive and the loop should run by iterating through the two lists of variables so that it takes the first column from the first list + the first column from the second list, then the same for the second column in the two lists etc. The problem is I can't get it to iterate through the lists properly, instead my loop runs more models than it should.

The dataframe I am describing here is just a subset I will actually have to run this 3772 times (I am working on RNA-seq transcript expression).

My dataframe is called dry, and contains 22 variables (columns) and 87 observations (rows). Column 1 contains genotypeIDs, column 2:11 contains one set of independent variables to iterate through, column 12:21 contains a second set of independent variables to iterate through, and column 23 contains my dependent variable called FITNESS_DRY. This is what the structure looks like:

str(dry)
'data.frame':   87 obs. of  22 variables:
$ geneID     : Factor w/ 87 levels "e10","e101","e102",..: 12 15 17 24 25    30 35 36 38 39 ...
$ RDPI_T1    : num  1.671 -0.983 -0.776 -0.345 0.313 ...
$ RDPI_T2    : num  -0.976 -0.774 -0.532 -1.137 1.602 ...
$ RDPI_T3    : num  -0.197 -0.324 0.805 -0.701 -0.566 ...
$ RDPI_T4    : num  0.289 -0.92 1.117 -1.214 -0.447 ...
$ RDPI_T5    : num  -0.671 1.963 NA -1.024 -0.295 ...
$ RDPI_T6    : num  2.606 -1.116 -0.383 -0.893 0.119 ...
$ RDPI_T7    : num  -0.843 -0.229 -0.297 0.504 -0.712 ...
$ RDPI_T8    : num  -0.227 NA NA -0.816 -0.761 ...
$ RDPI_T9    : num  0.754 -1.304 1.867 -0.514 -1.377 ...
$ RDPI_T10   : num  1.1352 -0.1028 -0.69 2.0242 -0.0925 ...
$ DRY_T1     : num  0.6636 -0.64508 -0.24643 -1.43231 -0.00855 ...
$ DRY_T2     : num  1.008 0.823 -0.658 -0.148 0.272 ...
$ DRY_T3     : num  -0.518 -0.357 1.294 0.408 0.771 ...
$ DRY_T4     : num  0.0723 0.2834 0.5198 1.6527 0.4259 ...
$ DRY_T5     : num  0.1831 1.9984 NA 0.0923 0.1232 ...
$ DRY_T6     : num  -1.55 0.366 0.692 0.902 -0.993 ...
$ DRY_T7     : num  -2.483 -0.334 -1.077 -1.537 0.393 ...
$ DRY_T8     : num  0.396 NA NA -0.146 -0.468 ...
$ DRY_T9     : num  -0.694 0.353 2.384 0.665 0.937 ...
$ DRY_T10    : num  -1.24 -1.57 -1.36 -3.88 -1.4 ...
$ FITNESS_DRY: num  1.301 3.365 0.458 0.346 1.983 ...

The goal is to run 10 multiple regressions looking like this:

lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)

and so forth iterating through all ten columns for both lists This is equivalent to the following in terms of indexing

lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])
lm1<-lm(FITNESS_DRY~dry[,12]+dry[,2])

etc.

My loop should then calculate summaries for each model, and combine all the pvalues (4th column of the lm summary) in an output object.

I first defined my variable lists

var_list<-list(
var1=dry[,12:21],
var2=dry[,2:11]
)

This is the loop I tried which doesn't work properly:

lm.test1<-name<-vector()
for (i in 12:length(var_list$var1)){
    for (j in 2:length(var_list$var2)){
lm.tmp<-lm(FITNESS_DRY~dry[,i]+dry[,j], na.action=na.omit, data=dry)
sum.tmp<-summary(lm.tmp)
lm.test1<-rbind(lm.test1,sum.tmp$coefficients[,4]) }
}

The loop returns this error message:

Warning message:
In rbind(lm.test6, sum.tmp$coefficients[, 4]) :
number of columns of result is not a multiple of vector length (arg 2)

I can call up the object "lm.test1", but that object has 27 lines instead of the 10 I want, so the iterations are not working properly here. Can anyone help with this please? Also, it would be great if I could add the names of my columns for each list of variables into the summary. I have tried using this for each variable list but without succes:

name<-append(name, as.character(colnames(var_list$var1))

Any ideas? Thanks in advance for any help!

UPDATE1: More information on the full data set: My actual data will still contain a first colum "geneID", then I have 3772 columns names DRY_T1....DRY_T3772, then another 3772 columns names RDPI_T1...RDPI_T3772, and finally my dependent variable "FITNESS_DRY". I still want to run all additive models as such:

lm1<-lm(FITNESS_DRY~DRY_T1+RDPI_T1)
lm2<-lm(FITNESS_DRY~DRY_T2+RDPI_T2)
lm3772<-lm(FITNESS_DRY~DRY_T3772+RDPI_T3772)

I simulated a dataset as such:

set.seed(2)
dat3 = as.data.frame(replicate(7544, runif(20)))
names(dat3) = paste0(rep(c("DRY_T","RDPI_T"),each=3772), 1:3772)
dat3 = cbind(dat3, FITNESS_DRY=runif(20))

I then run the for loop:

models = list()
for(i in 1:3772) {
vars = names(dat3)[grepl(paste0(i,"$"), names(dat3))]
models2[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" 
+")),
                                 data = dat3)
}

This works fine on the data simulation, but when I try it on my real dataset that is set up exactly in the same way it doesn't work. The loop is probably having issues handling numbers with two or more digits. I get this error message:

Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 
0 (non-NA) cases

UPDATE 2: Indeed the model had issues handling numbers with two or more digits. To see how things go wrong in the original version I used this: (my dataset is called "dry2"):

names(dry2)[grepl("2$", names(dry2))]

This returned all DRY_T and RDPI_T variables with numbers containing "2" instead of just one pair of DRY_T and RDPI_T.

To fix the issue this new code works:

models = list()

for(i in 1:3772) {
vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]
models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" +   ")),
data = dry2)
}
  • 3
    If you can provide a small reproducible example (see [here](https://stackoverflow.com/questions/5963269/how-to-make-a-great-r-reproducible-example)) I can code up something, but in any case, I would approach the problem by creating two data.frames, each holding columns either DRY_* or RDPI_*. I would then make an anonymous function that would take in lists dry and rdpi and a vector fitness. This function would run the regression and output a summary and do `mapply(myfun, dry, rdpi, fitness = fitness)`. – Roman Luštrik Jun 28 '19 at 21:33
  • Any way I can upload my dataset? – Elena Hamann Jun 28 '19 at 22:08
  • 1
    just edited my question to include > dput(head(dry2, 10)) – Elena Hamann Jun 28 '19 at 22:15

2 Answers2

1

There are a number of ways to set up the model formulas for iteration. Here's one approach, which we demonstrate using a for loop or map from the purrr package for the iteration. Then we use tidy from the broom package to get the coefficients and p-values.

library(tidyverse)
library(broom)

# Fake data
set.seed(2)
dat = as.data.frame(replicate(20, runif(20)))
names(dat) = paste0(rep(c("DRY_T","RDPI_T"),each=10), 0:9)
dat = cbind(dat, FITNESS_DRY=runif(20))

# Generate list of models

# Using for loop
models = list()

for(i in 0:9) {

  # Get the two column names to use for this iteration of the model
  vars = names(dat)[grepl(paste0(i,"$"), names(dat))]

  # Fit the model and add results to the output list
  models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
                                 data = dat)
}

# Same idea using purrr::map to iterate
models = map(0:9 %>% set_names(), 
             ~ {
               vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
               form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
               lm(form, data = dat)
             })
# Check first two models
models[1:2]
#> $`0`
#> 
#> Call:
#> lm(formula = form, data = dat)
#> 
#> Coefficients:
#> (Intercept)       DRY_T0      RDPI_T0  
#>      0.4543       0.3025      -0.1624  
#> 
#> 
#> $`1`
#> 
#> Call:
#> lm(formula = form, data = dat)
#> 
#> Coefficients:
#> (Intercept)       DRY_T1      RDPI_T1  
#>     0.64511     -0.33293      0.06698
# Get coefficients and p-values for each model in a single data frame
results = map_df(models, tidy, .id="run_number")

results
#> # A tibble: 30 x 6
#>    run_number term        estimate std.error statistic p.value
#>    <chr>      <chr>          <dbl>     <dbl>     <dbl>   <dbl>
#>  1 0          (Intercept)   0.454      0.153     2.96  0.00872
#>  2 0          DRY_T0        0.303      0.197     1.53  0.143  
#>  3 0          RDPI_T0      -0.162      0.186    -0.873 0.395  
#>  4 1          (Intercept)   0.645      0.185     3.49  0.00279
#>  5 1          DRY_T1       -0.333      0.204    -1.63  0.122  
#>  6 1          RDPI_T1       0.0670     0.236     0.284 0.780  
#>  7 2          (Intercept)   0.290      0.147     1.97  0.0650 
#>  8 2          DRY_T2        0.270      0.176     1.53  0.144  
#>  9 2          RDPI_T2       0.180      0.185     0.972 0.345  
#> 10 3          (Intercept)   0.273      0.187     1.46  0.162  
#> # … with 20 more rows

Created on 2019-06-28 by the reprex package (v0.2.1)

If you don't need to save the model objects, you can just return the data frame of coefficients and p-values:

results = map_df(0:9 %>% set_names(), 
            ~ {
              vars = names(dat)[grepl(paste0(.x,"$"), names(dat))]
              form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
              tidy(lm(form, data = dat))
            }, .id="run_number")

UPDATE: In answer to your comment, if you replace all instances of 0:9 with 1:10 (sorry, didn't notice that your column suffixes went from 1:10 rather than 0:9), and all instances of dat (my fake data) with dry2 (or whatever name you're using for your data frame), the code will run with your data, so long as the column names are the same as the ones you used in your question. If you're using different column names, you'll need to adapt the code, either by hard-coding the new names or by creating a function that can accept whatever column names you're using for the model(s) you're generating.

To explain what the code is doing: First, we need to get the names of the columns we want to use in each iteration of the model. For example, in the for-loop version:

vars = names(dry2)[grepl(paste0(i,"$"), names(dry2))]

When, for example, i=2, this resolves to:

vars = names(dry2)[grepl("2$", names(dry2))]
vars
[1] "RDPI_T2" "DRY_T2"

So those are the two columns we want to use to generate a regression formula. "2$" is a regular expression (regular expressions is a string matching language) that means: match values in names(dry2) that end with the number '2'.

To create our formula we do:

paste(vars, collapse=" + ")
[1] "RDPI_T2 + DRY_T2"
form = paste("FITNESS_DRY ~ ", paste(vars, collapse=" + "))
form
[1] "FITNESS_DRY ~  RDPI_T2 + DRY_T2"

And now we have our regression formula which we use inside lm.

Each iteration (either with for or map or, in @RomanLuštrik's suggestion, mapply), generates the successive models.

UPDATE 2: As I noted in the comment, I realized that the regular expression paste(i, "$") will fail (by matching more than one of each type of independent variable column) when the final number is more than one digit. So, try this instead (and similarly for the map version):

models = list()

for(i in 1:3772) {

  # Get the two column names to use for this iteration of the model
  vars = names(dry2)[names(dry2) %in% paste0(c("DRY_T", "RDPI_T"), i)]

  # Fit the model and add results to the output list
  models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")),
                                 data = dry2)
}

To see how things go wrong in the original version, run, for example, names(dry2)[grepl("2$", names(dry2))]

eipi10
  • 91,525
  • 24
  • 209
  • 285
  • Thanks eipi10. I will try your approach! – Elena Hamann Jun 28 '19 at 22:23
  • Your code works great on the fake data you generated, but I don't understand how I can adapt the script to run with my own dataset. Can you walk me through it a little more please? – Elena Hamann Jun 28 '19 at 22:53
  • 1
    See update to my answer and let me know if that helps. – eipi10 Jun 28 '19 at 23:45
  • 1
    It's working beautifully on my data subset :-) Thanks a lot for explaining the code a bit more. I should now be able to adapt it to my full data set. Again, I really appreciate your help! – Elena Hamann Jun 29 '19 at 00:13
  • Hey eipi10. Is there a way to include na.action=na.omit in the model specification? Your code works great on my data subset, it also works when I simulate a fake dataset like you did but with more columns, but it won't run on my original data. I get this error message: "Error in lm.fit(x, y, offset = offset, singular.ok = singular.ok, ...) : 0 (non-NA) cases". I tried this but it doesn't help: models[[as.character(i)]] = lm(paste("FITNESS_DRY ~ ", paste(vars, collapse=" + ")), + data = dry5, na.action=na.omit) – Elena Hamann Jun 29 '19 at 17:28
  • It's probably not an NA issue afterall, since lm excludes NAs by default. I rather think something is wrong with the variable matching. I now have 3772 columns with DRY_T1...DRY_T3772, and 3772 columns with RDPI_T1...RDPI_T3772. I still have the FITNESS_DRY variable as my dependent variable. Why is this not computing just as with the subset? – Elena Hamann Jun 29 '19 at 19:10
  • Can you paste into your question (rather than into a comment) the minimal code needed to reproduce the error, along with the full error message? Also, I think the regular expression match needs to be updated to handle numbers with two or more digits, otherwise, you'll match, *all* columns names that end in, say, `2` when `2` on the second iteration. I'll update my answer in a moment to account for that. – eipi10 Jun 29 '19 at 19:16
  • I just added more information to my question. And yes it seems to be an issue with handling numbers with two or more digits! – Elena Hamann Jun 29 '19 at 23:02
  • YES!! this works perfectly, and I learned how to use the %in% operator :-) Is there any literature you can recommend to learn more about the type of coding you've helped me with? I did not know the tidyverse or broom package, and I'd love to become more proficient. – Elena Hamann Jun 30 '19 at 12:06
  • Look at the free online book [R for Data Science](https://r4ds.had.co.nz/) by the creator of the tidyverse. – eipi10 Jun 30 '19 at 16:34
0

Consider reshaping your very wide data frame to long format with reshape which is usually the preferred data format of practically any data science application.

For your needs, this requires two reshapes for each _T metric. After reshaping, create a T_NUM indicator (i.e., stripping the number of DRY_T## and RDPI_T##) and use that along with corresponding FITNESS_DRY to merge the two metrics.

Finally, use by to slice your large data frame by T_NUM groupings to build a list of models. Below uses the dat3 you simulated above. Altogether, all with base R: reshape -> TNUM <- ... -> merge -> by -> lm. The other methods, lapply, within, and Reduce are helpers for DRY-er code.

# TWO DATA FRAMES OF FOUR COLUMNS
df_list <- lapply(c("DRY_T", "RDPI_T"), function(i)
  within(reshape(dat3[c(grep(i, names(dat3)), ncol(dat3))],
                 varying = list(names(dat3)[grep(i, names(dat3))]),
                 v.names = i,
                 times = names(dat3)[grep(i, names(dat3))],
                 timevar = "T_NUM",
                 direction = "long"), {
           T_NUM <- as.integer(gsub(i, "", as.character(T_NUM)))
           id <- NULL
  })
)

# MERGE BOTH DFs
long_df <- Reduce(function(x, y) merge(x, y, by=c("T_NUM", "FITNESS_DRY")), df_list)

head(long_df, 10)
#    T_NUM FITNESS_DRY     DRY_T     RDPI_T
# 1      1   0.1528837 0.9438393 0.87948274
# 2      1   0.1925344 0.7023740 0.65120186
# 3      1   0.2193480 0.2388948 0.29875871
# 4      1   0.2743660 0.1291590 0.60097630
# 5      1   0.2877732 0.9763985 0.66921847
# 6      1   0.3082835 0.7605133 0.22456361
# 7      1   0.5196165 0.1848823 0.79543965
# 8      1   0.5603618 0.1680519 0.08759412
# 9      1   0.5789254 0.8535485 0.37942053
# 10     1   0.6291315 0.5526741 0.43043940

# NAMED LIST OF 3,772 MODELS
model_list <- by(long_df, long_df$T_NUM, function(sub) 
                  lm(FITNESS_DRY ~ DRY_T + RDPI_T, sub))

Output

summary(model_list$`1`)$coefficients
#               Estimate Std. Error    t value     Pr(>|t|)
# (Intercept)  0.7085512  0.1415849  5.0044269 0.0001085681
# DRY_T       -0.1423601  0.1985256 -0.7170867 0.4830577281
# RDPI_T      -0.1273237  0.2179249 -0.5842551 0.5667218157

summary(model_list$`2`)$coefficients
#              Estimate Std. Error   t value   Pr(>|t|)
# (Intercept) 0.3907525  0.1524423 2.5632809 0.02015115
# DRY_T       0.1952963  0.1990449 0.9811672 0.34026853
# RDPI_T      0.1979513  0.1884085 1.0506492 0.30812662

summary(model_list$`3`)$coefficients
#               Estimate Std. Error  t value   Pr(>|t|)
# (Intercept) 0.38836708  0.2076638 1.870172 0.07878049
# DRY_T       0.06995811  0.1965336 0.355960 0.72624947
# RDPI_T      0.27144752  0.2115787 1.282962 0.21672143

...
Parfait
  • 104,375
  • 17
  • 94
  • 125