1

I have a question about coding interaction effects using dummy coding which I’d be really grateful for your advice on please.

Imagine I want to design an experiment to measure the impact of amount of food eaten in grams (continuous variable) on happiness scores (continuous variable), in three animals: zebras, lions & giraffes. My variables would be i) happiness, ii) food and iii) species. As I understand it, I could set up a regression model in three different ways:

Using dummy coding (i.e. 1 or 0 for zebra & lion), with giraffe as my reference category:

Happiness ~ food + food x zebra + food x lion

Including interaction terms for all species:

Happiness ~ food + food x zebra + food x lion + food x giraffe

By including interaction terms for all species without a main effect:

Happiness ~ food x zebra + food x lion + food x giraffe

The 2nd example makes the most sense to me, as it seems to isolate the trans-species effect of food eaten in the “food” variable, and then captures the interaction effect for each species. However, most guides I’ve read seem to recommend the former approach, but they don’t explain why. Please could someone explain whether one model is preferable?

NB: My concern with the first approach is that the “food” variable neither reflects a trans-species effect (because it is skewed towards the effect for giraffes, as they don’t have their interaction term) nor is it equivalent to the food*giraffe term, (as it includes some trans-species effect). Have I misunderstood something?

Shawn Hemelstrand
  • 2,676
  • 4
  • 17
  • 30
  • There's bound to be a good answer to this somewhere on CrossValidated e.g. this one has some context: https://stats.stackexchange.com/questions/172943/interpreting-dummy-variables-in-glm. I'd recommend posting your question in this site as well as the community is geared more towards theoretical statistical answers. – Jonny Phelps Dec 22 '21 at 12:19
  • I don't think #2 answer is possible though with your data. `food x zebra + food x lion + food x giraffe` will account for all the `food` effects, meaning there is no data left to determine what the standard `food` effect would be e.g. for a tiger. This is because you're fitting a fixed effects model, which assumes that `food` is independently distributed for each species. If this were a mixed or random effects model, you can assume that each species has the same underlying food distribution and can arrive at a global food effect. – Jonny Phelps Dec 22 '21 at 12:22
  • answers #1 and #3 will give you the same result, just formatted differently. The effect of giraffe in #1 gets absorbed in to the intercept + food, whilst the #3 in food:giraffe. Predictively, it will give the same thing, you just interpret the model differently. – Jonny Phelps Dec 22 '21 at 12:23
  • also, check this - https://stackoverflow.com/questions/23347467/is-there-any-way-to-fit-a-glm-so-that-all-levels-are-included-i-e-no-refer – Jonny Phelps Dec 22 '21 at 12:25

0 Answers0