How to generate new rows in a dataframe in R to turn aggregate success/failure counts into individual ones/zeros (logistic regression)?

Question

I know that the dependent variable can be expressed in matrix form with cbind(). That is not what I'm asking. It is also clear to me that the results are identical. However, the ability to generate areas under the curve and similar plots is no longer straightforward with a binomial type of formatting of the dependent variable (versus a Bernoulli dependent variable of 0's and 1's).

In the following code I am looking at recent data from the US gov including number of vaccination doses per state per 100K people (d$jabsx100k) and the number of hospital beds (inpatient and ICU combined, for example) occupied by COVID patients in each state, d$TOTpos, versus those occupied by patients with any other diagnoses, d$TOTneg.

So if a line of the dataset is:

state     TOTpos     TOTneg     jabsx100k
AK        186        1019       135083

I would like to turn into

state     COVID    jabsx100k
AK        1          135083
AK        1          135083
AK        0          135083
.
.
.

just to run glm(COVID ~ jabsx100k, family='binomial').

Instead, as you can see below, what I ran is

glm(cbind(d$TOTpos,d$TOTneg) ~ scale(d$jabsx100k), family='binomial')

which works just fine, but it doesn't allow me to use the pROC package to do other things with the model after it's run.

Going in the opposite direction is easier with the aggregate() function, but I don't know how to "de-aggregate" the counts.

If there is any way of calling glm() that circumvents the need to do cbind(success, failures) it would indirectly answer the question well.

The code:

library(RCurl)
url <- "https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/COVID19%20Reported%20Patient%20Impact%20and%20Hospital%20Capacity%20by%20State%20Timeseries"
dat <- read.csv(url)
# 13: ""inpatient_beds_used"  "
# 27: "staffed_icu_adult_patients_confirmed_and_suspected_covid"
# 31:"total_adult_patients_hospitalized_confirmed_and_suspected_covid" 
# 39 ""total_staffed_adult_icu_beds"   
data <- dat[,c(1,2,27,39,31,13)]
colnames(data) <- c('state','date','ICUpos','ICUb','IPpos','IPb')
data[,2] <- as.Date(data$date)
df <- data[!(rowSums(is.na(data))),]
df$ICUneg <- df[,4]-df[,3]
df$IPneg  <- df[,6]-df[,5]
df <- df[,c('state','date','ICUpos','ICUneg','ICUb','IPpos','IPneg','IPb')]
df <- df[order(df[,'state']), ]

# Adding IP's and ICU:
df$TOTpos <- df$ICUpos + df$IPpos
df$TOTneg <- df$ICUneg + df$IPneg
df$TOTb   <- df$ICUb + df$IPb

states <- unique(df[,1])
m <- matrix(0,length(states),9)

for(i in 1:length(states)){
  temp <- df[df$state==states[i],]
  temp <- temp[order(as.Date(temp$date,format = "%d/%m/%Y")),]
  m[i,1:9] <- round(colMeans(tail(temp[,3:11],n)))
}
hosp <- cbind.data.frame(states,m)
hosp <- hosp[order(hosp[,'states']), ]
names(hosp) <- names(df)[c(1,3:11)]
hosp$Proportion <- hosp$TOTpos/(hosp$TOTneg+hosp$TOTpos)
head(hosp)
y = url("https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/covid19_vaccinations_in_the_united_states.csv")
vaccines <- read.csv(y)
vac <- vaccines[,c(1,3)]
# 3: Doses Delivered per 100K
names(vac) <- c('state','jabsx100k')

s = url("https://raw.githubusercontent.com/RInterested/DATASETS/gh-pages/states.csv")
abbr <- read.csv(s)

vac <- vac[vac[,1] %in% abbr[,1],]

for(i in 1:nrow(vac)){
    vac[i,1] <- abbr[which(abbr[,1]==vac[i,1]),2]
}

vac[,2] <- as.numeric(vac[,2])

d <- merge(hosp,vac, by="state")
head(d)
fit2 <- glm(cbind(d$TOTpos,d$TOTneg) ~ scale(d$jabsx100k), family='binomial')
summary(fit2)
exp(cbind(OR = coef(fit2), confint(fit2)))

Please note apropos of the answer that I did already scrutinize the site for ideas, including the option for weights in the glm() function. In particular, fit3 <- glm(d$Proportion ~ scale(d$jabsx100k), weights =d$TOTb , family="binomial"), where d$TOTb is the total number of beds.

This is not an improvement in any way, since I'm still stuck in the same issue of not having a Bernoulli format of the independent variable to allow me to do AUC plots and other stuff like that after the regression.

Also trying to pre-empt... I did try negative binomial and Poisson regression.

Side note: if you want to use `require` then check it's return value; if not, your code is fragile since if the package is not available, no error is generated. (Or use `library` instead, it will error with a clear message if the package is not available.) See https://stackoverflow.com/a/51263513/3358272 — r2evans, Sep 13 '21 at 15:11

Johannes · Answer 1 · 2021-09-13T15:36:08.797

0

Not sure, but maybe it's the weights parameter of the function glm() you're looking for. (This refers to your question: "If there is any way of calling glm() that circumvents the need to do cbind(success, failures) it would indirectly answer the question well.")

The help file of glm says:

"For a binomial GLM prior weights are used to give the number of trials when the response is the proportion of successes"

So if you've only the proportion of 'successes' than the weights vector is the number of trials.

See also antother discussion on this here: https://stats.stackexchange.com/questions/141412/r-glm-function-with-family-binomial-and-weight-specification

edited Sep 13 '21 at 15:36

answered Sep 13 '21 at 15:28

Johannes

1,024
13
32

I should have included that... I did do it. Here: `fit3 <- glm(d$Proportion ~ scale(d$jabsx100k), weights =d$TOTb , family="binomial")` – Antoni Parellada Sep 13 '21 at 15:29
This may be a partial solution because it forces me to carry this additional caveat across any subsequent analysis. – Antoni Parellada Sep 13 '21 at 15:30
I honestly think that this answer would be best reconfigured into a comment, although I edited the OP to reflect that I had tried that, and didn't offer a solution to the actual problem. – Antoni Parellada Sep 13 '21 at 16:32

score 0 · Answer 2 · answered Sep 13 '21 at 18:10

I can't find any built-in functions to do exactly what I need, but it is not difficult to do it ad hoc:

longform <- matrix(0, sum(d$TOTpos) + sum (d$TOTneg), 2)
for (i in 1: sum (d$TOTpos)) longform[i,1] <- 1
pos <- rep(d$jabsx100k[1], d$TOTpos[1])
neg <- rep(d$jabsx100k[1], d$TOTneg[1])
for (i in 2:nrow(d)) pos <- c(pos, rep(d$jabsx100k[i], d$TOTpos[i]))
for (i in 2:nrow(d)) neg <- c(neg, rep(d$jabsx100k[i], d$TOTneg[i]))
longform[,2] <- c(pos,neg)
longform <- (as.data.frame(longform))
names(longform) <- c("COVIDpos", "Vaccines")
head(longform)

fitlong <- glm(longform$COVIDpos ~ scale(longform$Vaccines), family='binomial')
summary(fitlong)
exp(cbind(OR = coef(fitlong), confint(fitlong)))

How to generate new rows in a dataframe in R to turn aggregate success/failure counts into individual ones/zeros (logistic regression)?

2 Answers2