I'm working with panel data comprising several years of observations of schools. My DV is a proportion of exam passers but is not normally distributed, and many observations of the DV are > 0.8. A panel linear model using plm()
(from package plm
) is therefore inappropriate, so I am trying to treat the DV as a binary response and use logistic regression with pglm()
(from package pglm
). I have counts of the numbers of test takers and passers.
I have determined that I need to use fixed effects (within-unit) estimation for these data as I'm interested in the average change in exam pass rates within schools. I have far too many observations to post the full dataset but here is a small reproducible example of the error message:
id <- c(1, 1, 1, 2, 2, 2, 3, 3, 3, 4, 4, 4)
year <- rep(c(2017, 2018, 2019), 4)
proportion <- c(.67, .77, .79, .88, .89, .85, .79, .81, .79, .87, .75, .74)
X1 <- c(.05, .041, .037, .015, .012, .021, .081, .055, .062, .034, .031, .022)
X2 <- c(145, 146, 145, 155, 154, 154, 150, 152, 156, 148, 150, 151)
takers <- c(50, 62, 55, 112, 101, 119, 44, 45, 48, 66, 69, 60)
passers <- c(34, 48, 43, 99, 90, 101, 35, 36, 38, 57, 52, 44)
fails <- takers - passers
data <- as.data.frame(cbind(id, year, proportion, X1, X2, takers, passers, fails))
pglm::pglm(cbind(passers, fails) ~ X1 + X2, index = c("id", "year"), model = "within", family = binomial(link = "logit"), data = data)
#> Error in `.rowNamesDF<-`(x, value = value): duplicate 'row.names' are not allowed
Created on 2020-10-21 by the reprex package (v0.3.0)
I do not encounter an issue running a regular logit:
glm(cbind(passers, fails) ~ X1 + X2,family = binomial(link = "logit"), data = data)
And I am also familiar with an alternative to the treat-DV-as-binary approach, namely the betareg() package which uses beta regression]2, but I do not see a why to use fixed effects with betareg(). I can also run this code using glmer() and setting a random intercept (1|id), but a random effects approach does not make theoretical sense given my research question and a Hausman test indicates I need fixed effects anyway.
My interpretation of the error message is that row names are being duplicated somehow; I ensured this was not the case by setting all row names to NULL but this did not fix the issue:
row.names(data) <- NULL
I also referred to seemingly similar questions on this issue such as this but I have ensured that there are no duplications in the id-year pairing.
So, any help on figuring out what the error is would be much appreciated. Comments on methodology are also welcome, of course.