Notation of categorical variables in regression analysis

Question

In the process of studying logistic regression using carret's mdrr data, questions arise. I created a full model using a total of 19 variables, and I have questions about the notation of the categorical variable.

In my regression model, the categorical variables are:

nDB : 0 or 1 or 2
nR05 : 0 or 1
nR10 : 1 or 2

I created a full model using glm, but I do not know why the names of categorical variables have one of the numbers in the category.

-------------------------------------------------------------------------------

glm(formula = mdrrClass ~ ., family = binomial, data = train)

#Coefficients:
#(Intercept)         nDB1         nDB2           nX        nR051        nR101        nBnz2  
  #5.792e+00    5.287e-01   -3.103e-01   -2.532e-01   -9.291e-02    9.259e-01   -2.108e+00  
        #SPI          BLI          PW4         PJI2          Lop         BIC2         VRA1  
  #3.222e-05   -1.201e+01   -3.754e+01   -5.467e-01    1.010e+00   -5.712e+00   -2.424e-04  
       # PCR          H3D          FDI         PJI3        DISPm        DISPe      G.N..N.  
# -6.397e-02   -4.360e-04    3.458e+01   -6.579e+00   -5.690e-02    2.056e-01   -7.610e-03  

#Degrees of Freedom: 263 Total (i.e. Null);  243 Residual
#Null Deviance:     359.3 
#Residual Deviance: 232.6   AIC: 274.6

-------------------------------------------------------------------------------

The above results show that nDB is numbered, and nR05 and nR10 are related to categories. I am wondering why numbers are attached as above.

score 0 · Answer 1 · answered Sep 24 '18 at 08:48

It's always the case for categorical variables, espacially when they are not binary (like your nDB). It's so that you know for which value you have the coefficient. For the nDB variable the model has created two new variables: nDB_1 which equals 1 if nDB=1 and equals 0 if nDB= 0 or nDB=2.

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

To analyze a binary variable (whose values would be TRUE / FALSE, 0/1, or YES / NO) according to a quantitative explanatory variable, a logistic regression can be used.

Consider for example the following data, where x is the age of 40 people, and y the variable indicating if they bought a death metal album in the last 5 years (1 if "yes", 0 if "no" ) Graphically, we can see that, more likely, the older people are, the less they buy death metal.

Logistic regression is a special case of the Generalized Linear Model (GLM). With a classical linear regression model, we consider the following model:

Y = αX + β

The expectation of Y is therefore predicted as follows:

E (Y) = αX + β

Here, because of the binary distribution of Y, the above relations can not apply. To "generalize" the linear model, we therefore consider that

g (E (Y)) = αX + β

where g is a link function. In this case, for a logistic regression, the link function corresponds to the logit function:

logit (p) = log (p/(1-p))

Note that this logit function transforms a value (p) between 0 and 1 (such as a probability for example) into a value between - ∞ and + ∞. Here's how to do the logistic regression under R:

myreg=glm(y~x, family=binomial(link=logit))
summary(myreg)

glm(formula = y ~ x, family = binomial(link = logit))
## 
## Deviance Residuals: 
##     Min       1Q   Median       3Q      Max  
## -1.8686  -0.7764   0.3801   0.8814   2.0253  
## 
## Coefficients:
##             Estimate Std. Error z value Pr(>|z|)   
## (Intercept)   5.9462     1.9599   3.034  0.00241 **
## x            -0.1156     0.0397  -2.912  0.00360 **
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
## 
## (Dispersion parameter for binomial family taken to be 1)
## 
##     Null deviance: 52.925  on 39  degrees of freedom
## Residual deviance: 39.617  on 38  degrees of freedom
## AIC: 43.617
## 
## Number of Fisher Scoring iterations: 5

We obtain the following model:

logit (E (Y)) = - 0.12X + 5.95

and we note that the (negative) influence of age on the purchase of death metal albums is significant at the 5% level (p(>[Z| ----> < 5%).

Thus, logistic regression is often used to bring out risk factor (like Age but also BMI, Sex and so on …)

This answer doesn't address the question at all. OP is asking about dummy coding categorical predictors — Simon, Sep 27 '18 at 22:35

score 0 · Accepted Answer · answered Sep 27 '18 at 22:57

When you have categorical predictors in any regression model you need to create dummy variables. R does this for you and the output you see are the contrasts

Your variable nDB has 3 levels: 0, 1, 2

One of those needs to be chosen as the reference level (R was chosen 0 for you in this case, but this can also be specified manually). Then dummy variables are created to compare every other level against your reference level: 0 vs 1 and 0 vs 2

R names these dummy variables nDB1 and nDB2. nDB1 is for the 0 vs 1 contrast, and nDB2 is for the 0 vs 2 contrast. The numbers after the variable names are just to indicate which contrast you're looking at

The coefficient values are interpreted as the difference in your y (outcome) value between groups 0 and 1 (nDB1), and separately between groups 0 and 2 (nDB2). In other words, what change in the outcome would you expect when moving from one group to the other?

Your other categorical variables have 2 levels and are just a simpler case of the above

For example, nR05 only has 0 and 1 as values. 0 was chosen as your reference, and because theres only 1 possible contrast here, a single dummy variable is created comparing 0 vs 1. In the output that dummy variable is called nR051

Answer I read it well. I have additional questions on short cuts. First, since nDB has three levels, does not it compare to 1 and 2 in addition? Secondly, you said you could manually adjust the comparison, but how? — monsoon, Sep 30 '18 at 10:43
1) dummy coding is done with comparisons against the reference, so it wont to 1 vs 2 unfortunately. 2) you can relevel the factor to choose your reference. see this question here: https://stackoverflow.com/questions/3872070/how-to-force-r-to-use-a-specified-factor-level-as-reference-in-a-regression — Simon, Sep 30 '18 at 12:48

Notation of categorical variables in regression analysis

-------------------------------------------------------------------------------

-------------------------------------------------------------------------------

3 Answers3

Y = αX + β

E (Y) = αX + β

g (E (Y)) = αX + β

logit (p) = log (p/(1-p))

logit (E (Y)) = - 0.12X + 5.95