5



how I have to implement a categorical variable in a binary logistic regression in R? I want to test the influence of the professional fields (student, worker, teacher, self-employed) on the probability of a purchase of a product.

In my example y is a binary variable (1 for buying a product, 0 for not buying).
- x1: is the gender (0 male, 1 female)
- x2: is the age (between 20 and 80)
- x3: is the categorical variable (1=student, 2=worker, 3=teacher, 4=self-employed)

set.seed(123)
y<-round(runif(100,0,1))
x1<-round(runif(100,0,1))
x2<-round(runif(100,20,80))
x3<-round(runif(100,1,4))
test<-glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)

If I implement x3 (the professional fields) in my regression above, I get the wrong estimates/interpretation for x3.

What I have to do to get the right influence/estimates for the categorical variable (x3)?

Thanks a lot

Jordan
  • 75
  • 1
  • 1
  • 5
  • 1
    I believe that since all of y, x1, x2 and x3 are random, the correct relationship is that these are unrelated and all slopes are zero. – G5W Jan 05 '18 at 21:14
  • You say "I get the wrong estimates/interpretation". What are the estimates you expect? I don't see what the problem is here. – MrFlick Jan 05 '18 at 21:16
  • ok, but this regression above is only an example. My problem is: If I set, for example, x3 like as.factor(x3), then I have wrong coefficients for x3. – Jordan Jan 05 '18 at 21:19
  • the coefficients say, If I increase x1 (or x2) for 1 unit, then the probability of a purchase raise/falls. But in x3 are only professional fields, so an increase by one unit says nothing about the probability of an increase/decrease of y – Jordan Jan 05 '18 at 21:23
  • 1
    That is not at all how you simulated your data. Right now there is no relationship at all between `y` and the any of `x` values. And currently you are modeling `x3` as continuous. If you want to use dummy variables use `factor(x3)` in your formula. So i'm not sure if your question is really about simulating data or about modeling. The "right" estimates in the example above are all 0 and in the sample none of them are statistically significantly different than 0. – MrFlick Jan 05 '18 at 21:31
  • Yes, you are right. My questition is how can I model x3? But I don´t know how. When I use `factor(x3)` I think it is wrong. – Jordan Jan 05 '18 at 21:34
  • Generally people turn these into dummy variables using `?model.matrix` – Ian Wesley Jan 05 '18 at 21:36
  • Then I have to form 4 dummy variables? For every professional field one dummy? – Jordan Jan 05 '18 at 21:38
  • Yes that is how folks normally deal with categorical variables in models. – Ian Wesley Jan 05 '18 at 21:52
  • Perhaps this [link](https://stats.idre.ucla.edu/r/dae/logit-regression/) would be usefull to you. I has been to me. You can also read the interpretation of odd ratios. – Cedric Jan 05 '18 at 21:55
  • 2
    @IanWesley That's not true. If you use a formula with a factor, R will make dummy variables for you. Rarely is it necessary to use `model.matrix` directly. – MrFlick Jan 05 '18 at 22:04
  • @MrFlick Many models require a model.matix and will not work with data frames or with categorical variables. It is highly model dependent, but I appreciate your point that that is how glm works. I assumed the questioner wanted to know how to encode categorical variables for use in logistic regression. Which the most common way is to use a dummy variable. – Ian Wesley Jan 05 '18 at 22:07
  • @MrFlick I did not mean to take away from your pervious point with my comment. Factor should work. I was just trying to illustrate how to encode the data for the model. – Ian Wesley Jan 05 '18 at 22:12
  • Thanks for your comments! Then I try it either with `as.factor(x3)` or I create 4 dummy variables. – Jordan Jan 05 '18 at 22:20

1 Answers1

4

I suggest you to set x3 as a factor variable, there is no need to create dummies:

set.seed(123)
y <- round(runif(100,0,1))
x1 <- round(runif(100,0,1))
x2 <- round(runif(100,20,80))
x3 <- factor(round(runif(100,1,4)),labels=c("student", "worker", "teacher", "self-employed"))

test <- glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)

Here is the summary:

This is the output of your model:

Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial(link = "logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4665  -1.1054  -0.9639   1.1979   1.4044  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)      0.464751   0.806463   0.576    0.564
x1               0.298692   0.413875   0.722    0.470
x2              -0.002454   0.011875  -0.207    0.836
x3worker        -0.807325   0.626663  -1.288    0.198
x3teacher       -0.567798   0.615866  -0.922    0.357
x3self-employed -0.715193   0.756699  -0.945    0.345

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.47  on 99  degrees of freedom
Residual deviance: 135.98  on 94  degrees of freedom
AIC: 147.98

Number of Fisher Scoring iterations: 4

In any case, I suggest you to study this post on R-bloggers: https://www.r-bloggers.com/logistic-regression-and-categorical-covariates/

Scipione Sarlo
  • 1,470
  • 1
  • 17
  • 31
  • 1
    Thanks for the help. my problem before (with `as.factor`) was the interpreting with regard to the reference level. But now I have understand it and my problem is solved. – Jordan Jan 07 '18 at 12:11