2

I have a dataset with 1-3 versions of the dependent variable, and 10-15 independent variables. I'd like to run a glm command for the model, but would like it to loop for ALL possible combinations of independent variables. I've never written code for a loop, and want to make sure I set it up correctly.

Below is a small subset of my data frame. The actual dataframe has an explicit name for each variable; not just "DepVar1" or "IndVar1."

dfPRAC <- structure(list(DepVar1 = c(0, 0, 0, 0, 1, 0, 0, 1, 1, 1, 1, 1, 
1, 1, 1, 0, 0, 0, 0, 0, 1, 1, 0, 1), DepVar2 = c(0, 1, 0, 0, 
1, 1, 0, 1, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1), 
    IndVar1 = c(0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 1, 
    0, 0, 0, 1, 0, 0, 0, 1, 0), IndVar2 = c(1, 3, 9, 1, 5, 1, 
    1, 8, 4, 6, 3, 15, 4, 1, 1, 3, 2, 1, 10, 1, 9, 9, 11, 5), 
    IndVar3 = c(0.500100322564443, 1.64241601558441, 0.622735778490702, 
    2.42429812749226, 5.10055213237027, 1.38479786027561, 7.24663629203007, 
    0.5102348706939, 2.91566510995229, 3.73356170379198, 5.42003495939846, 
    1.29312896116503, 3.33753833987496, 0.91783513806083, 4.7735736131668, 
    1.17609362602233, 5.58010703426296, 5.6668754863739, 1.4377813063642, 
    5.07724130837643, 2.4791994535923, 2.55100067348583, 2.41043629522981, 
    2.14411703944206)), .Names = c("DepVar1", "DepVar2", "IndVar1", 
"IndVar2", "IndVar3"), row.names = c(NA, 24L), class = "data.frame")

My current code for running a single glm model is:

RegPRAC <- glm(DepVar1 ~ IndVar1, data=dfPRAC, family=binomial("logit"))
summary(RegPRAC)

I'd like to run the model for ALL possible combinations of independent variables, with all combinations of dependent variables, but I'm not sure where to start. I was thinking something like:

for (i in dfPRAC$IndVar1:dfPRAC$IndVar3) {glm(DepVar1 ~ i, data=dfPRAC, family=binomial("logit")) }

I tried running it, but got several errors. Any suggestions would be appreciated.

David Arenburg
  • 91,361
  • 17
  • 137
  • 196
KKL234
  • 367
  • 1
  • 5
  • 23
  • What errors do you get? – alexforrence Feb 19 '15 at 18:03
  • Error in model.frame.default(formula = DepVar1 ~ i, data = dfPRAC, drop.unused.levels = TRUE) : variable lengths differ (found for 'i') In addition: Warning messages: 1: In dfPRAC$IndVar1:dfPRAC$IndVar3 : numerical expression has 24 elements: only the first used 2: In dfPRAC$IndVar1:dfPRAC$IndVar3 : numerical expression has 24 elements: only the first used – KKL234 Feb 19 '15 at 18:05

1 Answers1

6

Maybe something like that:

dep_vars <- c("DepVar1", "DepVar2") 
ind_vars <- c("IndVar1", "IndVar2", "IndVar3")

# create all combinations of ind_vars
ind_vars_comb <- 
  unlist( sapply( seq_len(length(ind_vars)), 
          function(i) {
               apply( combn(ind_vars,i), 2, function(x) paste(x, collapse = "+"))
          }))

# pair with dep_vars:
var_comb <- expand.grid(dep_vars, ind_vars_comb ) 

# formulas for all combinations
formula_vec <- sprintf("%s ~ %s", var_comb$Var1, var_comb$Var2)

# create models
glm_res <- lapply( formula_vec, function(f)   {
    fit1 <- glm( f, data = dfPRAC, family = binomial("logit"))
    fit1$coefficients <- coef( summary(fit1))
    return(fit1)
})
names(glm_res) <- formula_vec

# get model for specific formula
glm_res[["DepVar1 ~ IndVar1"]] 

# coefficients for var1 ~ var1
coef(glm_res[["DepVar1 ~ IndVar1"]])

# p-values for var1 ~ var2
coef(glm_res[["DepVar1 ~ IndVar2"]])[,"Pr(>|z|)"]

# p-values in a data.frame
p_values <- 
  cbind(formula_vec, as.data.frame ( do.call(rbind,
        lapply(glm_res, function(x) {
          coefs <- coef(x)
          rbind(c(coefs[,4] , rep(NA, length(ind_vars) - length(coefs[,4]) + 1)))
        })
  )))

Result:

                         formula_vec (Intercept)    IndVar1         V3        V4
1                  DepVar1 ~ IndVar1  1.00000000 1.00000000         NA        NA
2                  DepVar2 ~ IndVar1  0.65526203 0.29437334         NA        NA
3                  DepVar1 ~ IndVar2  0.29307777 0.19121066         NA        NA
4                  DepVar2 ~ IndVar2  0.07298241 0.03858791         NA        NA
5                  DepVar1 ~ IndVar3  0.99950535 0.99940963         NA        NA
6                  DepVar2 ~ IndVar3  0.52105212 0.44715614         NA        NA
7          DepVar1 ~ IndVar1+IndVar2  0.31112860 0.76310468 0.18416266        NA
8          DepVar2 ~ IndVar1+IndVar2  0.06488501 0.08833369 0.03031766        NA
9          DepVar1 ~ IndVar1+IndVar3  0.99952006 0.99999188 0.99940957        NA
10         DepVar2 ~ IndVar1+IndVar3  0.38508258 0.29593637 0.45010697        NA
11         DepVar1 ~ IndVar2+IndVar3  0.28167430 0.15753070 0.54363164        NA
12         DepVar2 ~ IndVar2+IndVar3  0.22644873 0.04654188 0.84059019        NA
13 DepVar1 ~ IndVar1+IndVar2+IndVar3  0.27858393 0.71600105 0.14812808 0.5222330
14 DepVar2 ~ IndVar1+IndVar2+IndVar3  0.15634739 0.08611677 0.02889574 0.7449513
bergant
  • 7,122
  • 1
  • 20
  • 24
  • This worked well! Thank you for the suggestion. Two quick questions though. 1) How can I tell which Dependent Variable the model is running on? Only the Independent Variable is noted. And 2) Is there a way to also get p-values from this? As well as accurate prediction rate? I've used "ClassLog" in the past to get % prediction, but not sure how to apply it? – KKL234 Feb 20 '15 at 16:01
  • See [extract-pvalue-from-glm](http://stackoverflow.com/questions/23838937/extract-pvalue-from-glm) – bergant Feb 20 '15 at 16:45
  • Adding names worked great! However, the extract p-value question refers to a specific model. If I have 1,000 models and I want to check the p-values for all of them, can I do that with a single command rather than running it for each model? – KKL234 Feb 20 '15 at 18:56
  • thanks for adding the p-value line. Unfortunately, it just returns the p-value for an individual model. Is it possible to use just a few lines of code to get ALL p-values for the variable combinations, so I can easily compare them? – KKL234 Feb 23 '15 at 13:29
  • Have to loop over `glm_res` structure. See update for a result in a data frame. – bergant Feb 23 '15 at 14:26
  • Sorry, I should have re-emphasized this earlier, but as I was adding more and more independent variables, I realized that the code was just running 1 dependent variable with 1 independent variable. Is there a way to run the code for ALL possible combinations of independent variables? DepVar1 ~ all combinations of 2 Independent Variables, then Dep1 ~ all combinations of 3 Independent Variables etc. I tried changing the code to read: formula_vec <- sprintf("%s ~ %s+%s", var_comb$Var1, var_comb$Var2, var_comb$Var3) but it didn't work; the resulting model was just a list. – KKL234 Feb 23 '15 at 20:19
  • 1
    Updated. See 2 changes: 1) ind_vars is updated with all combinations of values, 2) p_values are extended – bergant Feb 23 '15 at 21:02
  • If you have a chance, would you mind expanding on what the "4" and "+1" represent in the extended p_value code. I'm trying to apply the code to a larger list of independent variables, and would like to understand where 4 and 1 came from so I can adjust them for changing numbers of independent variables. I've tried applying it to a list of 10 independent variables, and keep coming up with an "invalid time argument" error. Thanks for all of the help! – KKL234 Feb 23 '15 at 23:15
  • The intercept and p-values are allways in the 4th column of the coefficients table. The number of rows is 1 + number of independent variables. See for example `coef(glm_res[["DepVar1 ~ IndVar1+IndVar2"]])`. The code should work for more than 3 variables. – bergant Feb 23 '15 at 23:24
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/71537/discussion-between-kkl234-and-bergant). – KKL234 Feb 23 '15 at 23:28