Linear Regression model with dummy (dependent) variable and categorical (independent) variable in R

Question

My data are dummy variables (1 = if disclosed, 0 = not disclosed) as dependent variable and categorical variable (five types of sectors) as independent variable.

With these data, can a linear regression model be used?

My objectives are to identify which sectors do or do not disclose.

So is it a good way to use?, for example:

summary(lm(Disclosed ~ 0 + Sectors, data = df_0))

I add in the model "0 +", so that it also returns the first sector, eliminating the intercept. If I don't add it, I don't know why the first sector doesn't return it to me. I am very lost. Thanks!

If I use a binomial logistic regression, the significance values that I obtain with the estimated sign that it indicates will not be interpreted.

Call:
glm(formula = Disclosed ~ 0 + Sectors, family = binomial(link = "logit"), 
    data = df_0)

Deviance Residuals: 
     Min        1Q    Median        3Q       Max  
-0.96954  -0.32029  -0.00005  -0.00005   2.48638  

Coefficients:
                         Estimate Std. Error z value Pr(>|z|)    
SectorsCOMMUNICATION      -0.5108     0.5164  -0.989  0.32256    
SectorsCONSIMERSTAPLES   -20.5661  6268.6324  -0.003  0.99738    
SectorsCONSUMERDISCRET    -3.0445     1.0235  -2.975  0.00293 ** 
SectorsENERGY            -20.5661  3780.1276  -0.005  0.99566    
SectorsFINANCIALS         -2.9444     0.7255  -4.059 4.94e-05 ***
SectorsHEALTHCARE        -20.5661  5345.9077  -0.004  0.99693    
SectorsINDUSTRIALS       -20.5661  2803.4176  -0.007  0.99415    
SectorsINDUSTRIALS       -20.5661 17730.3699  -0.001  0.99907    
SectorsINFORMATION        -1.0986     0.8165  -1.346  0.17846    
SectorsMATERIALS         -20.5661  3780.1276  -0.005  0.99566    
SectorsREALESTATE        -20.5661  8865.1850  -0.002  0.99815    
SectorsUTILITIES         -20.5661  7238.3932  -0.003  0.99773    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 277.259  on 200  degrees of freedom
Residual deviance:  54.185  on 188  degrees of freedom
AIC: 78.185

Number of Fisher Scoring iterations: 19

This means that the financial and consumer discretionary sectors are the least disclosed, right?

On the other hand, if I apply an lm, it returns more consistent results. The sectors that spread the most are information and communication. They are significant and positive estimate values

Call:
lm(formula = Disclosed ~ 0 + Sectors, data = df_0)

Residuals:
    Min      1Q  Median      3Q     Max 
-0.3750 -0.0500  0.0000  0.0000  0.9546 

Coefficients:
                        Estimate Std. Error t value Pr(>|t|)    
SectorsCOMMUNICATION   3.750e-01  5.191e-02   7.224 1.22e-11 ***
SectorsCONSIMERSTAPLES 0.000e+00  7.341e-02   0.000 1.000000    
SectorsCONSUMERDISCRET 4.545e-02  4.427e-02   1.027 0.305815    
SectorsENERGY          0.000e+00  4.427e-02   0.000 1.000000    
SectorsFINANCIALS      5.000e-02  3.283e-02   1.523 0.129426    
SectorsHEALTHCARE      0.000e+00  6.260e-02   0.000 1.000000    
SectorsINDUSTRIALS     2.194e-18  3.283e-02   0.000 1.000000    
SectorsINDUSTRIALS     0.000e+00  2.076e-01   0.000 1.000000    
SectorsINFORMATION     2.500e-01  7.341e-02   3.406 0.000807 ***
SectorsMATERIALS       0.000e+00  4.427e-02   0.000 1.000000    
SectorsREALESTATE      0.000e+00  1.038e-01   0.000 1.000000    
SectorsUTILITIES       1.416e-17  8.476e-02   0.000 1.000000    
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.2076 on 188 degrees of freedom
Multiple R-squared:  0.2632,    Adjusted R-squared:  0.2162 
F-statistic: 5.597 on 12 and 188 DF,  p-value: 3.568e-08

consider running probabilistic models eg `logistic regression` or `probit regression` — Onyambu, May 07 '21 at 01:27
You might want to use a logistic regression, you can do that by adding `family = "binomial"` as an argument for `lm`. — Jonathan V. Solórzano, May 07 '21 at 02:35
If I indicate that it is a binomial, it returns the following error: `In lm.fit (x, y, offset = offset, singular.ok = singular.ok, ...): The additional argument `` family '' will be ignored.` — David Perea, May 07 '21 at 09:30

score 0 · Answer 1 · answered May 07 '21 at 04:15

0

It would be better to use logistic regression for this particular problem.

Regarding Linear Regression output, for categorical inputs (independent variables), lm takes first class/category in alphabetical order as the base class shown in intercept and returns relative results of other classes to that.

In the example, category A will be intercept and we will have relative result for other classes to class A

For example,

set.seed(100)

a <- sample(c(1,0), 100, replace = TRUE)
b <- sample(c('A', 'B', 'C', 'D', 'E'), 100, replace = TRUE)

lm(a ~ b)
Call:
lm(formula = a ~ b)

Coefficients:
(Intercept)           bB           bC           bD           bE  
   0.562500    -0.183190     0.104167    -0.107955    -0.006944

is same to

Call:
lm(formula = a ~ 0 + b)

Coefficients:
    bA      bB      bC      bD      bE  
0.5625  0.3793  0.6667  0.4545  0.5556

c <- broom::tidy(lm(a ~ 0 + b))
c$estimate
[1] 0.5625000 0.3793103 0.6666667 0.4545455 0.5555556

d <- broom::tidy(lm(a ~ b))
d$estimate
[1]  0.562500000 -0.183189655  0.104166667 -0.107954545 -0.006944444

d$estimate[2:5] + d$estimate[1]
[1] 0.3793103 0.6666667 0.4545455 0.5555556

answered May 07 '21 at 04:15

Roach

136
5

You need to use `glm` instead of `lm`, `glm` takes `family` as argument. – Roach May 07 '21 at 13:33
Wouldn't it be correct to use `lm` without a `family`? Using `glm`, it return me no coherent meanings. My model is better using `lm`. There would be a problem? @roach – David Perea May 07 '21 at 23:15
Basically in Logistic Regression, we get output of `logit` function, we have to classify, `ifelse(estimate > 0.5, 'Disclosed', 'Not Disclosed')`. The logit function is [0,1] whereas linear regression is `linear` basically . https://stackoverflow.com/questions/12146914/what-is-the-difference-between-linear-regression-and-logistic-regression explains this in more detail. – Roach May 07 '21 at 23:22
I understand. Thanks!. Could you help me solve my result that it returns to me? I have added it in the main message. Thanks! @roach – David Perea May 07 '21 at 23:43
`logit2prob <- function(logit){ odds <- exp(logit) prob <- odds / (1 + odds) return(prob) }` https://sebastiansauer.github.io/convert_logit2prob/. Use this function to convert logit output (coefficients) to probabilities. – Roach May 08 '21 at 02:19
And regarding linear regression, `OLS regression. When used with a binary response variable, this model is known as a linear probability model and can be used as a way to describe conditional probabilities. However, the errors (i.e., residuals) from the linear probability model violate the homoskedasticity and normality of errors assumptions of OLS regression, resulting in invalid standard errors and hypothesis tests. For a more thorough discussion of these and other problems with the linear probability model, see Long (1997, p. 38-40).` https://stats.idre.ucla.edu/r/dae/logit-regression/ – Roach May 08 '21 at 02:22
We can't use linear regression to check significance, p-values are misleading; the p-values you get from `glm` are better to check if there's relationship – Roach May 08 '21 at 02:26

Linear Regression model with dummy (dependent) variable and categorical (independent) variable in R

1 Answers1