My data are dummy variables (1 = if disclosed, 0 = not disclosed) as dependent variable and categorical variable (five types of sectors) as independent variable.
With these data, can a linear regression model be used?
My objectives are to identify which sectors do or do not disclose.
So is it a good way to use?, for example:
summary(lm(Disclosed ~ 0 + Sectors, data = df_0))
I add in the model "0 +", so that it also returns the first sector, eliminating the intercept. If I don't add it, I don't know why the first sector doesn't return it to me. I am very lost. Thanks!
If I use a binomial logistic regression, the significance values that I obtain with the estimated sign that it indicates will not be interpreted.
Call:
glm(formula = Disclosed ~ 0 + Sectors, family = binomial(link = "logit"),
data = df_0)
Deviance Residuals:
Min 1Q Median 3Q Max
-0.96954 -0.32029 -0.00005 -0.00005 2.48638
Coefficients:
Estimate Std. Error z value Pr(>|z|)
SectorsCOMMUNICATION -0.5108 0.5164 -0.989 0.32256
SectorsCONSIMERSTAPLES -20.5661 6268.6324 -0.003 0.99738
SectorsCONSUMERDISCRET -3.0445 1.0235 -2.975 0.00293 **
SectorsENERGY -20.5661 3780.1276 -0.005 0.99566
SectorsFINANCIALS -2.9444 0.7255 -4.059 4.94e-05 ***
SectorsHEALTHCARE -20.5661 5345.9077 -0.004 0.99693
SectorsINDUSTRIALS -20.5661 2803.4176 -0.007 0.99415
SectorsINDUSTRIALS -20.5661 17730.3699 -0.001 0.99907
SectorsINFORMATION -1.0986 0.8165 -1.346 0.17846
SectorsMATERIALS -20.5661 3780.1276 -0.005 0.99566
SectorsREALESTATE -20.5661 8865.1850 -0.002 0.99815
SectorsUTILITIES -20.5661 7238.3932 -0.003 0.99773
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 277.259 on 200 degrees of freedom
Residual deviance: 54.185 on 188 degrees of freedom
AIC: 78.185
Number of Fisher Scoring iterations: 19
This means that the financial and consumer discretionary sectors are the least disclosed, right?
On the other hand, if I apply an lm, it returns more consistent results. The sectors that spread the most are information and communication. They are significant and positive estimate values
Call:
lm(formula = Disclosed ~ 0 + Sectors, data = df_0)
Residuals:
Min 1Q Median 3Q Max
-0.3750 -0.0500 0.0000 0.0000 0.9546
Coefficients:
Estimate Std. Error t value Pr(>|t|)
SectorsCOMMUNICATION 3.750e-01 5.191e-02 7.224 1.22e-11 ***
SectorsCONSIMERSTAPLES 0.000e+00 7.341e-02 0.000 1.000000
SectorsCONSUMERDISCRET 4.545e-02 4.427e-02 1.027 0.305815
SectorsENERGY 0.000e+00 4.427e-02 0.000 1.000000
SectorsFINANCIALS 5.000e-02 3.283e-02 1.523 0.129426
SectorsHEALTHCARE 0.000e+00 6.260e-02 0.000 1.000000
SectorsINDUSTRIALS 2.194e-18 3.283e-02 0.000 1.000000
SectorsINDUSTRIALS 0.000e+00 2.076e-01 0.000 1.000000
SectorsINFORMATION 2.500e-01 7.341e-02 3.406 0.000807 ***
SectorsMATERIALS 0.000e+00 4.427e-02 0.000 1.000000
SectorsREALESTATE 0.000e+00 1.038e-01 0.000 1.000000
SectorsUTILITIES 1.416e-17 8.476e-02 0.000 1.000000
---
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.2076 on 188 degrees of freedom
Multiple R-squared: 0.2632, Adjusted R-squared: 0.2162
F-statistic: 5.597 on 12 and 188 DF, p-value: 3.568e-08