2

I'm running a regression in the form

reg=lm(y ~ x1+x2+x3+z1,data=mydata)

In the place of the last term, z1, I want to loop through a set of different variables, z1 through z10, running a regression for each with it as the last term. E.g. in second run I want to use

reg=lm(y ~ x1+x2+x3+z2,data=mydata)

in 3rd run:

reg=lm(y ~ x1+x2+x3+z3,data=mydata)

How can I automate this by looping through the list of z-variables?

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
Beta
  • 1,638
  • 5
  • 33
  • 67
  • 1
    possible duplicate of [Linear Regression loop for each independent variable individually against dependent](http://stackoverflow.com/questions/25036007/linear-regression-loop-for-each-independent-variable-individually-against-depend) – Sam Firke May 15 '15 at 16:48
  • 1
    @SamFirke The OP will probably also need assistance in learning how to construct each formula (since their formula is a little more involved) using `paste` or something, and that information isn't at the question you linked to. – joran May 15 '15 at 16:52
  • 1
    Yes, this is doable in R. One approach is to use an apply function like `lapply` which will take your vector of 10 variables and return a list of 10 `lm` objects with just that variable changing. See also http://stackoverflow.com/questions/13418148/efficient-looping-logistic-regression-in-r and http://stackoverflow.com/questions/15304514/for-loops-for-regression-over-multiple-variables-outputting-a-subset?rq=1. An apply function is more "R-ish" than a for-loop. – Sam Firke May 15 '15 at 16:53
  • 2
    One more: http://stackoverflow.com/a/17026386/210673 – Aaron left Stack Overflow May 15 '15 at 17:03
  • Thanks everybody for the quick response! Thanks again! – Beta May 15 '15 at 17:16
  • Also, you may like the `broom` package if you then want to get the coefficients out of the list of models and into a new data.frame. – Sam Firke May 15 '15 at 18:07

4 Answers4

6

While what Sam has provided works and is a good solution, I would personally prefer to go about it slightly differently. His answer has already been accepted, so I'm just posting this for the sake of completeness.

dat1 <- data.frame(y = rpois(100, 5),
                   x1 = runif(100),
                   x2 = runif(100),
                   x3 = runif(100),
                   z1 = runif(100),
                   z2 = runif(100))

lapply(colnames(dat1)[5:6],
       function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = " + ")), data = d),
       d = dat1)

Rather than looping over the actual columns of the data frame, this loops only over the string of names. This provides some speed improvements as fewer things are copied between iterations.

library(microbenchmark)

microbenchmark({ lapply(<what I wrote above>) })
# Unit: milliseconds
# expr
# {lapply(colnames(dat1)[5:6], function(x, d) lm(as.formula(paste("y ~ x1 + x2 + x3", x, sep = "+")), data = d), d = dat1)}
#       min       lq     mean   median       uq      max neval
#  4.014237 4.148117 4.323387 4.220189 4.281995 5.898811   100

microbenchmark({ lapply(<other answer>) })
# Unit: milliseconds
# expr
# {lapply(dat1[, 5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))}
#       min       lq     mean   median       uq    max neval
#  4.391494 4.505056 5.186972 4.598301 4.698818 51.573   100

The difference is fairly small for this toy example, but as the number of observations and predictors increases, the difference will likely become more pronounced.

Alex A.
  • 5,466
  • 4
  • 26
  • 56
  • +1, I agree that looping over the names is preferable. You could shorten this to `lapply(names(dat1)[5:6], function(x) lm(as.formula(paste0("y ~ x1 + x2 + x3 + ", x)), data = dat1))`. What is the benefit of `data = d), d = dat1)`? – Sam Firke May 15 '15 at 18:04
  • @SamFirke: As I understand it, the benefit of passing `dat1` as a parameter is to avoid using globals within the function. – Alex A. May 15 '15 at 18:07
  • Interesting answer! But will your solution apply to [this other question](https://stackoverflow.com/questions/58756403/combinations-of-variables-that-produce-the-smallest-quantities-in-an-r-function) as well? – rnorouzian Nov 08 '19 at 01:35
3

Depending on what your final goal is, it can be much faster to fit a base model, update it with add1, and extract the F-test/AIC you want:

> basemodel <- lm(y~x1+x2+x3, dat1)
> 
> add1(object=basemodel, grep("z\\d", names(dat1), value=TRUE), test="F")
Single term additions

Model:
y ~ x1 + x2 + x3
       Df Sum of Sq    RSS    AIC F value Pr(>F)
<none>              477.34 164.31               
z1      1    0.0768 477.26 166.29  0.0153 0.9019
z2      1    5.1937 472.15 165.21  1.0450 0.3093

See also ?update for refitting the model.

Neal Fultz
  • 9,282
  • 1
  • 39
  • 60
2

With this dummy data:

dat1 <- data.frame(y = rpois(100,5),
x1 = runif(100),
x2 = runif(100),
x3 = runif(100),
z1 = runif(100),
z2 = runif(100)
)

You could get your list of two lm objects this way:

 lapply(dat1[5:6], function(x) lm(dat1$y ~ dat1$x1 + dat1$x2 + dat1$x3 + x))

Which iterates through those two columns and substitutes them as arguments into the lm call.

As Alex notes below, it's preferable to pass the names through the formula, rather than the actual data columns as I have done here.

Sam Firke
  • 21,571
  • 9
  • 87
  • 105
  • 5
    Not a fan of the use of the global `dat1` inside of `lapply` like that. I would probably use `as.formula(paste("y ~ ...", x, sep = "+"))` with the `data` argument to `lm` populated with another parameter in `function`. – Alex A. May 15 '15 at 16:58
  • @AlexA.: Thanks Alex for your comment. I'll try to incorporate it while udating the code. – Beta May 15 '15 at 17:15
1

Here's a different approach using packages from the dplyr / tidyr family. It restructures the data to a long form, then uses group_by() from the dplyr package instead of lapply():

library(dplyr)
library(tidyr)
library(magrittr) # for use_series ()
dat1 %>%
  gather(varname, z, z1:z2) %>% # convert data to long form
  group_by(varname) %>%
  do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
  use_series(model)

This converts the data to a long format using gather, where the z-values occupy the same column. use_series() from the magrittr package return the list of lm objects instead of a data.frame. If you load the broom package, you can extract the model coefficients in this pipeline of code:

library(broom)
dat1 %>%
  gather(varname, z, z1:z2) %>%
  group_by(varname) %>%
  do(model = lm(y ~ x1 + x2 + x3 + z, data = .)) %>%
  glance(model) # or tidy(model)

Source: local data frame [2 x 12]
Groups: varname

  varname  r.squared adj.r.squared    sigma statistic   p.value df    logLik      AIC      BIC deviance df.residual
1      z1 0.06606736    0.02674388 2.075924  1.680099 0.1609905  5 -212.3698 436.7396 452.3707 409.3987          95
2      z2 0.06518852    0.02582804 2.076900  1.656192 0.1666479  5 -212.4168 436.8337 452.4647 409.7840          95

Data:

dat1 <- data.frame(y = rpois(100, 5), x1 = runif(100),
                   x2 = runif(100), x3 = runif(100),
                   z1 = runif(100), z2 = runif(100))
Sam Firke
  • 21,571
  • 9
  • 87
  • 105