different possible combinations of variables for a generalized linear model

Question

Is there a method in R to run a GLM for each of the different variables in a data frame with a different combinations e.g.

If I have 4 explanatory variables I can model Y as

m1 = glm(Y ~ V1, data = d)
m2 = glm(Y ~ V1 + V2, data = d)
m3 = glm(Y ~ V1 + V2 + V3, data = d)
m4 = glm(Y ~ V1 + V2 + V3 + V4, data = d)

However, I can also have

m5 = glm(Y ~ V1 + V2 + V4, data = d)

and so on.

Is there a method in R to select all of the different possible combinations of variables within a data frame to see which variables act as the best predictors?

score 13 · Accepted Answer · answered Mar 26 '14 at 14:47

This is called dredging:

library(MuMIn)
data(Cement)
fm1 <- lm(y ~ ., data = Cement)
dd <- dredge(fm1)

Global model call: lm(formula = y ~ ., data = Cement)
---
Model selection table 
   (Intrc)    X1      X2      X3      X4 df  logLik  AICc delta weight
4    52.58 1.468  0.6623                  4 -28.156  69.3  0.00  0.566
12   71.65 1.452  0.4161         -0.2365  5 -26.933  72.4  3.13  0.119
8    48.19 1.696  0.6569  0.2500          5 -26.952  72.5  3.16  0.116
10  103.10 1.440                 -0.6140  4 -29.817  72.6  3.32  0.107
14  111.70 1.052         -0.4100 -0.6428  5 -27.310  73.2  3.88  0.081
15  203.60       -0.9234 -1.4480 -1.5570  5 -29.734  78.0  8.73  0.007
16   62.41 1.551  0.5102  0.1019 -0.1441  6 -26.918  79.8 10.52  0.003
13  131.30               -1.2000 -0.7246  4 -35.372  83.7 14.43  0.000
7    72.07        0.7313 -1.0080          4 -40.965  94.9 25.62  0.000
9   117.60                       -0.7382  3 -45.872 100.4 31.10  0.000
3    57.42        0.7891                  3 -46.035 100.7 31.42  0.000
11   94.16        0.3109         -0.4569  4 -45.761 104.5 35.21  0.000
2    81.48 1.869                          3 -48.206 105.1 35.77  0.000
6    72.35 2.312          0.4945          4 -48.005 109.0 39.70  0.000
5   110.20               -1.2560          3 -50.980 110.6 41.31  0.000
1    95.42                                2 -53.168 111.5 42.22  0.000

So, here the first model shown is includes the 'best' possible combination of predictors. Is there a way then of stating which one of those predictors is the most 'influential' in that model i.e. does X1 X2 X3 of X4 explain most of the variation in the data? or is this a different question all together? — KatyB, Mar 26 '14 at 15:22
Yes, that is a different different question and not easy to answer since the parameters are not independent. — Roland, Mar 26 '14 at 15:27
OK, so those values under X1...etc i.e. 1.468... dont represent the contribution of that variable? — KatyB, Mar 26 '14 at 16:30

score 4 · Answer 2 · answered Mar 26 '14 at 14:48

If you only wish to use base R and no packages that allow you to do dredging, you can use the combn function and make a list of all possible GLMs objects:

d <- data.frame(replicate(5, rnorm(10)))
names(d) <- c('Y', paste0('V', 1:4))
dep_var <- 'Y'
indep_vars <- setdiff(names(d), dep_var)

glms <- Reduce(append, lapply(seq_along(indep_vars),
  function(num_vars) {
    Reduce(append, apply(combn(length(indep_vars), num_vars), 2, function(vars) {
      formula_string <- paste(c(dep_var, paste(indep_vars[vars], collapse = "+")), collapse = '~')
      structure(list(glm(as.formula(formula_string), data = d)), .Names = formula_string)
    }))
  }
))

print(names(glms))
# [1] "Y~V1"          "Y~V2"          "Y~V3"          "Y~V4"          "Y~V1+V2"       "Y~V1+V3"       "Y~V1+V4"       "Y~V2+V3"       "Y~V2+V4"       "Y~V3+V4"       "Y~V1+V2+V3"    "Y~V1+V2+V4"
# [13] "Y~V1+V3+V4"    "Y~V2+V3+V4"    "Y~V1+V2+V3+V4"

print(glms[["Y~V2+V3+V4"]])

# Call:  glm(formula = as.formula(formula_string), data = d)
#
# Coefficients:
# (Intercept)           V2           V3           V4
#     0.12721      0.04748      0.11369     -0.04258

# Degrees of Freedom: 9 Total (i.e. Null);  6 Residual
# Null Deviance:      8.932
# Residual Deviance: 8.695  AIC: 36.98

It may be simplier to use `expand.grid` to generate all combinations: `all.var.comb <- expand.grid(lapply(seq_along(indep_vars), c, 0))` — Kamil Bartoń, Mar 27 '14 at 13:06
@Robert Krzyzanowski Great answer! do you know how to get the results from Cross validation of the resulted models? Now I want to choose the "best" model but doing it one by one will be very slow. Thank you. — programandoconro, Oct 30 '17 at 04:09

different possible combinations of variables for a generalized linear model

2 Answers2

Linked

Related