Run all possible interactions in GLM regression using R

Question

I have a large dataset of medical insurance claims on which I want to apply GLM regression. I have 4 categorical predictor variables specifically Gender, Age groups, Nationality, and Room Type (VIP, normal etc).

My basic GLM model will include the intercept term and these 4 variables. I now want to introduce two-way interactions but I am not certain about which interactions are significant for the model and which are not. For this purpose, I want to run all possible combinations of the interactions along with the 4 base predictors and then compare all the model results based on a certain characteristic such as AIC or BIC or R-square.

I want to know if there is a function or an easy way in R to run all the possible interactions and save their AIC/BIC/R-square without having to write down the glm function for each possible model.

A few examples of the models to run would be:

 1. intercept + Gender + Age + Nationality + RoomType
 2. intercept + Gender + Age + Nationality + RoomType + gender*age
 3. intercept + Gender + Age + Nationality + RoomType + gender*nationality
 4. intercept + Gender + Age + Nationality + RoomType + gender*roomtype
 5. intercept + Gender + Age + Nationality + RoomType + age*nationality
 6. intercept + Gender + Age + Nationality + RoomType + age*roomtype
 7. intercept + Gender + Age + Nationality + RoomType + nationality*roomtype
 8. intercept + Gender + Age + Nationality + RoomType + gender*age + gender*nationality

and so on.

If you want all interactions, try the formula `Resp ~ Gender * Age * Nationality * RoomType`. — Rui Barradas, May 11 '18 at 14:38
If you want all interactions *sequentially*, I believe there is no easy function to do this for you. I suggest building it manually with something like `combn(x,2)` then `combn(x,3)` ..., used with `as.formula` or such. This explodes a little, though: think about `G+A+N+R+G*A` and `G+A+N+R+G*A+G*N` and `G+A+N+R+G*A+G*R`, etc. You want all 2-ways *and permutations of 2-ways*, then 3-ways and all permutations of 3-ways and some 2-ways ... do you see how this becomes a mess? I suggest this (and some pedagogic rationale) might be why such a function does not (yet) exist. — r2evans, May 11 '18 at 14:51
It is very easy in this age of accessible computer power and machine learning to assume that if you only construct a fancy algorithm and let a computer churn on it for a few billion cycles, any problem can be solved. Sadly, the world isn't that simple. Algorithmic model selection is a very tempting idea, but it is fraught with pits. [Gung and others](https://stats.stackexchange.com/questions/20836) have explained this in more detail over on CV, but in short: 'regression to the mean', 'overfitting' and 'multiple comparison problem', are concepts you should keep in mind. — AkselA, May 11 '18 at 16:59

score 2 · Answer 1 · answered May 11 '18 at 15:11

Let's first generate some combinations of variable names.

vars <- c("Gender", "Age", "Nationality", "RoomType")
comb.vars <- expand.grid(vars, vars, stringsAsFactors = FALSE)
comb.vars <- comb.vars[!(comb.vars[,1] == comb.vars[,2]),]

i.vars <- apply(comb.vars, 1, paste, collapse = "*")

Then, let's combine the interactions into batches of exhaustive combinations (inspiration here).

combs.vars <- list(i.vars)
k <- length(i.vars) - 1
while(k > 1){
 combs <- t(combn(i.vars, k))
 combs.vars <- c(combs.vars, split(combs, seq(nrow(combs))))
 k <- k - 1
}

Last, let's create formulas out of the combinations and run GLM on them.

res <- NULL

for(i in 1:length(combs.vars)){
 f <- formula(paste("response ~ Gender + Age + Nationality + RoomType +", 
                    paste(combs.vars[[i]], collapse = "+")))
 fit <- glm(f, data = input.data)
 res <- c(res, fit$call, AIC(fit))
}

res <- data.frame(matrix(res, ncol = 2, byrow = TRUE))

Note, that response and input.data need to be replaced with your real names of the respective response variable name and the data.frame with data.

Looks good, but this only includes two-way interactions, right? — AkselA, May 11 '18 at 15:16
Yes. All of them, plus all combinations of all two-way interactions. The OP asked for two-way. To expand to the three-way, include an analogical extra section with `expand.grid(vars, vars, vars)`. But, as @r2evans says, is such an approach reasonable? — nya, May 11 '18 at 15:17
It wasn't clear from the question whether only two-ways were required, but even what you've supplied here is above and beyond reasonable, I think. My combinatorics isn't very strong, so I don't know how many possible interaction combinations there are with 5 variables, but I suspect it's rather a lot. — AkselA, May 11 '18 at 16:18

Run all possible interactions in GLM regression using R

1 Answers1