I know that the dependent variable can be expressed in matrix form with cbind()
. That is not what I'm asking. It is also clear to me that the results are identical. However, the ability to generate areas under the curve and similar plots is no longer straightforward with a binomial type of formatting of the dependent variable (versus a Bernoulli dependent variable of 0's and 1's).
In the following code I am looking at recent data from the US gov including number of vaccination doses per state per 100K people (d$jabsx100k
) and the number of hospital beds (inpatient and ICU combined, for example) occupied by COVID patients in each state, d$TOTpos
, versus those occupied by patients with any other diagnoses, d$TOTneg
.
So if a line of the dataset is:
state TOTpos TOTneg jabsx100k
AK 186 1019 135083
I would like to turn into
state COVID jabsx100k
AK 1 135083
AK 1 135083
AK 0 135083
.
.
.
just to run glm(COVID ~ jabsx100k, family='binomial')
.
Instead, as you can see below, what I ran is
glm(cbind(d$TOTpos,d$TOTneg) ~ scale(d$jabsx100k), family='binomial')
which works just fine, but it doesn't allow me to use the pROC
package to do other things with the model after it's run.
Going in the opposite direction is easier with the aggregate()
function, but I don't know how to "de-aggregate" the counts.
If there is any way of calling glm()
that circumvents the need to do cbind(success, failures)
it would indirectly answer the question well.
The code:
library(RCurl)
url <- "https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/COVID19%20Reported%20Patient%20Impact%20and%20Hospital%20Capacity%20by%20State%20Timeseries"
dat <- read.csv(url)
# 13: ""inpatient_beds_used" "
# 27: "staffed_icu_adult_patients_confirmed_and_suspected_covid"
# 31:"total_adult_patients_hospitalized_confirmed_and_suspected_covid"
# 39 ""total_staffed_adult_icu_beds"
data <- dat[,c(1,2,27,39,31,13)]
colnames(data) <- c('state','date','ICUpos','ICUb','IPpos','IPb')
data[,2] <- as.Date(data$date)
df <- data[!(rowSums(is.na(data))),]
df$ICUneg <- df[,4]-df[,3]
df$IPneg <- df[,6]-df[,5]
df <- df[,c('state','date','ICUpos','ICUneg','ICUb','IPpos','IPneg','IPb')]
df <- df[order(df[,'state']), ]
# Adding IP's and ICU:
df$TOTpos <- df$ICUpos + df$IPpos
df$TOTneg <- df$ICUneg + df$IPneg
df$TOTb <- df$ICUb + df$IPb
states <- unique(df[,1])
m <- matrix(0,length(states),9)
for(i in 1:length(states)){
temp <- df[df$state==states[i],]
temp <- temp[order(as.Date(temp$date,format = "%d/%m/%Y")),]
m[i,1:9] <- round(colMeans(tail(temp[,3:11],n)))
}
hosp <- cbind.data.frame(states,m)
hosp <- hosp[order(hosp[,'states']), ]
names(hosp) <- names(df)[c(1,3:11)]
hosp$Proportion <- hosp$TOTpos/(hosp$TOTneg+hosp$TOTpos)
head(hosp)
y = url("https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/covid19_vaccinations_in_the_united_states.csv")
vaccines <- read.csv(y)
vac <- vaccines[,c(1,3)]
# 3: Doses Delivered per 100K
names(vac) <- c('state','jabsx100k')
s = url("https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/states.csv")
abbr <- read.csv(s)
vac <- vac[vac[,1] %in% abbr[,1],]
for(i in 1:nrow(vac)){
vac[i,1] <- abbr[which(abbr[,1]==vac[i,1]),2]
}
vac[,2] <- as.numeric(vac[,2])
d <- merge(hosp,vac, by="state")
head(d)
fit2 <- glm(cbind(d$TOTpos,d$TOTneg) ~ scale(d$jabsx100k), family='binomial')
summary(fit2)
exp(cbind(OR = coef(fit2), confint(fit2)))
Please note apropos of the answer that I did already scrutinize the site for ideas, including the option for weights
in the glm()
function. In particular, fit3 <- glm(d$Proportion ~ scale(d$jabsx100k), weights =d$TOTb , family="binomial")
, where d$TOTb
is the total number of beds.
This is not an improvement in any way, since I'm still stuck in the same issue of not having a Bernoulli format of the independent variable to allow me to do AUC plots and other stuff like that after the regression.
Also trying to pre-empt... I did try negative binomial and Poisson regression.