1

Will anyone be able to explain how to set constants for different levels of categorical variables in r?

I have read the following: How to set the Coefficient Value in Regression; R and it does a good job for explaining how to set a constant for the whole of a categorical variable. I would like to know how to set one for each level.

As an example, let us look at the MTCARS dataset:

df <- as.data.frame(mtcars)

df$cyl <- as.factor(df$cyl)
set.seed(1)
glm(mpg ~ cyl + hp + gear, data = df)

This gives me the following output:

Call:  glm(formula = mpg ~ cyl + hp + gear, data = df)

Coefficients:

(Intercept)         cyl6         cyl8           hp         gear
     19.80268     -4.07000     -2.29798     -0.05541      2.79645  

Degrees of Freedom: 31 Total (i.e. Null);  27 Residual

Null Deviance:      1126 

Residual Deviance: 219.5    AIC: 164.4

If I wanted to set cyl6 to -.34 and cyl8 to -1.4, and then rerun to see how it effects the other variables, how would I do that?

nghauran
  • 6,648
  • 2
  • 20
  • 29
Jordan
  • 1,415
  • 3
  • 18
  • 44

1 Answers1

0

I think this is what you can do

df$mpgCyl=df$mpg
df$mpgCyl[df$cyl==6]=df$mpgCyl[df$cyl==6]-0.34
df$mpgCyl[df$cyl==8]=df$mpgCyl[df$cyl==8]-1.4
model2=glm(mpgCyl ~ hp + gear, data = df)
> model2

Call:  glm(formula = mpgCyl ~ hp + gear, data = df)

Coefficients:
(Intercept)           hp         gear  
   16.86483     -0.07146      3.53128  

UPDATE withe comments:

cyl is a factor, therefore by default it contributes to glm as offset, not slope. Actually cyl==4 is 'hidden' but existing in the glm as well. So in your first glm what the models says is:

1) for cyl==4: mpg=19.8-0.055*hp+2.79*gear
2) for cyl==6: mpg=(19.8-4.07)-0.055*hp+2.79*gear
3) for cyl==8: mpg=(19.8-2.29)-0.055*hp+2.79*gear

Maybe you can also check here https://stats.stackexchange.com/questions/213710/level-of-factor-taken-as-intercept and here Is there any way to fit a `glm()` so that all levels are included (i.e. no reference level)?

Hope this helps

Antonios
  • 1,919
  • 1
  • 11
  • 18
  • thank you. Would you explain why it is `df$mpgCyl[df$cyl==6]=df$mpgCyl[df$cyl==6]-0.34` and not `df$mpgCyl[df$cyl==6]=df$mpgCyl[df$cyl==6]*0.34`? – Jordan Feb 16 '18 at 13:34
  • I reply in updated answer as it requires some writing – Antonios Feb 16 '18 at 13:46
  • Thank you. I do understand the output of a factor as an offset compared to a slope (for continuous data). Thank you for the update. Now I see how your answer mathematically makes sense. – Jordan Feb 16 '18 at 13:50