categorical variable in logistic regression in r

Question

how I have to implement a categorical variable in a binary logistic regression in R? I want to test the influence of the professional fields (student, worker, teacher, self-employed) on the probability of a purchase of a product.

In my example y is a binary variable (1 for buying a product, 0 for not buying).
- x1: is the gender (0 male, 1 female)
- x2: is the age (between 20 and 80)
- x3: is the categorical variable (1=student, 2=worker, 3=teacher, 4=self-employed)

set.seed(123)
y<-round(runif(100,0,1))
x1<-round(runif(100,0,1))
x2<-round(runif(100,20,80))
x3<-round(runif(100,1,4))
test<-glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)

If I implement x3 (the professional fields) in my regression above, I get the wrong estimates/interpretation for x3.

What I have to do to get the right influence/estimates for the categorical variable (x3)?

Thanks a lot

I believe that since all of y, x1, x2 and x3 are random, the correct relationship is that these are unrelated and all slopes are zero. — G5W, Jan 05 '18 at 21:14
You say "I get the wrong estimates/interpretation". What are the estimates you expect? I don't see what the problem is here. — MrFlick, Jan 05 '18 at 21:16
ok, but this regression above is only an example. My problem is: If I set, for example, x3 like as.factor(x3), then I have wrong coefficients for x3. — Jordan, Jan 05 '18 at 21:19
the coefficients say, If I increase x1 (or x2) for 1 unit, then the probability of a purchase raise/falls. But in x3 are only professional fields, so an increase by one unit says nothing about the probability of an increase/decrease of y — Jordan, Jan 05 '18 at 21:23
That is not at all how you simulated your data. Right now there is no relationship at all between `y` and the any of `x` values. And currently you are modeling `x3` as continuous. If you want to use dummy variables use `factor(x3)` in your formula. So i'm not sure if your question is really about simulating data or about modeling. The "right" estimates in the example above are all 0 and in the sample none of them are statistically significantly different than 0. — MrFlick, Jan 05 '18 at 21:31
Yes, you are right. My questition is how can I model x3? But I don´t know how. When I use `factor(x3)` I think it is wrong. — Jordan, Jan 05 '18 at 21:34
Generally people turn these into dummy variables using `?model.matrix` — Ian Wesley, Jan 05 '18 at 21:36
Then I have to form 4 dummy variables? For every professional field one dummy? — Jordan, Jan 05 '18 at 21:38
Yes that is how folks normally deal with categorical variables in models. — Ian Wesley, Jan 05 '18 at 21:52
Perhaps this [link](https://stats.idre.ucla.edu/r/dae/logit-regression/) would be usefull to you. I has been to me. You can also read the interpretation of odd ratios. — Cedric, Jan 05 '18 at 21:55
@IanWesley That's not true. If you use a formula with a factor, R will make dummy variables for you. Rarely is it necessary to use `model.matrix` directly. — MrFlick, Jan 05 '18 at 22:04
@MrFlick Many models require a model.matix and will not work with data frames or with categorical variables. It is highly model dependent, but I appreciate your point that that is how glm works. I assumed the questioner wanted to know how to encode categorical variables for use in logistic regression. Which the most common way is to use a dummy variable. — Ian Wesley, Jan 05 '18 at 22:07
@MrFlick I did not mean to take away from your pervious point with my comment. Factor should work. I was just trying to illustrate how to encode the data for the model. — Ian Wesley, Jan 05 '18 at 22:12
Thanks for your comments! Then I try it either with `as.factor(x3)` or I create 4 dummy variables. — Jordan, Jan 05 '18 at 22:20

score 4 · Accepted Answer · answered Jan 06 '18 at 09:44

I suggest you to set x3 as a factor variable, there is no need to create dummies:

set.seed(123)
y <- round(runif(100,0,1))
x1 <- round(runif(100,0,1))
x2 <- round(runif(100,20,80))
x3 <- factor(round(runif(100,1,4)),labels=c("student", "worker", "teacher", "self-employed"))

test <- glm(y~x1+x2+x3, family=binomial(link="logit"))
summary(test)

Here is the summary:

This is the output of your model:

Call:
glm(formula = y ~ x1 + x2 + x3, family = binomial(link = "logit"))

Deviance Residuals: 
    Min       1Q   Median       3Q      Max  
-1.4665  -1.1054  -0.9639   1.1979   1.4044  

Coefficients:
                 Estimate Std. Error z value Pr(>|z|)
(Intercept)      0.464751   0.806463   0.576    0.564
x1               0.298692   0.413875   0.722    0.470
x2              -0.002454   0.011875  -0.207    0.836
x3worker        -0.807325   0.626663  -1.288    0.198
x3teacher       -0.567798   0.615866  -0.922    0.357
x3self-employed -0.715193   0.756699  -0.945    0.345

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 138.47  on 99  degrees of freedom
Residual deviance: 135.98  on 94  degrees of freedom
AIC: 147.98

Number of Fisher Scoring iterations: 4

In any case, I suggest you to study this post on R-bloggers: https://www.r-bloggers.com/logistic-regression-and-categorical-covariates/

Thanks for the help. my problem before (with `as.factor`) was the interpreting with regard to the reference level. But now I have understand it and my problem is solved. — Jordan, Jan 07 '18 at 12:11

categorical variable in logistic regression in r

1 Answers1

Linked