What dummy variable to choose as reference in glm?

Question

I am currently analyzing the occupancy of bat boxes and the factors that are influencing the occupancy. To determine the most significant variables influencing the occupancy I am running a glm with occupancy as my response variable (0=occupied / 1=not occupied) and different explanatory variables which are numerical except one categorical variable (with 4 levels Bat box mounted on tree/pole/balcony/facade).

my code is:

modelb <- glm(occupation ~ TreeCov + distance_to_water + mounted_on, 
  family = binomial(link="cloglog"), data = mydata)

In the results I get:

coefficient         p value
TreeCov             0.0344         
distance_to_water   0.1291   
mounted_onTREE      0.7676   
mounted_onFACADE    0.4319     
mounted_onPOLE      0.0770

with

mounted_on <- relevel(mounted_on, ref="Tree")

the reference is changed from balcony to tree and when I run the model I get different p values for my dummy variables.

coefficient         p value
TreeCov             0.0344         
distance_to_water   0.1291   
mounted_onBALCONY   0.45272   
mounted_onFACADE    0.0122     
mounted_onPOLE      0.02661

How do I choose which dummy variable should be my reference?

I'd recommend looking at the top solution [here](https://stackoverflow.com/questions/23347467/is-there-any-way-to-fit-a-glm-so-that-all-levels-are-included-i-e-no-refer). You can include all the factor levels by adding a `+0` in, which should then tell you if the factor level is significant against the intercept. The reason the p-values change is because they are saying whether the factor level is significantly different to the base level. So changing the base level will change the test taking place. — Jonny Phelps, Oct 04 '19 at 15:12

What dummy variable to choose as reference in glm?

0 Answers0