I am not sure if this is a question for stackoverflow or crossvalidated as it contains both an R specific coding part and a general statistics part.
In a project we want to use predictive margins, or more general, means of predicted values. The model used for prediction is a logistic regression model. The data comes from multiple surveys conducted at different times without clusters and comes with weights for known characteristics in the population. The data is weighted for each timespan.
As the data will grow over time, we don't want to do the prediction from the data used for modelling but with a dataframe containing the information of the population.
Thanks to this answer I know how to calculate the confidence intervals for every observation.
However I want to get the mean propabilty for different groups. For the propabilities I can just calculate the mean of the propabilities, but how do I get the right confidence intervals?
It seems that marginpred
and svypredmeans
from the package survey
by Thomas Lumley do some of the things I want, but don't allow to do the prediction on new data.
Here is some code and data to show the approach. Please consider that in my real use case the data I use for predictions is not the same I use for modelling.
#libraries
library(dplyr)
# Data for modelling
data(mtcars)
# Get a weight column because in my real use case the data has population weights
mtcars["weight"] <- runif(nrow(mtcars), 500, 1000)
numtofac <- c("cyl", "vs", "am", "gear", "carb")
mtcars[numtofac] <- lapply(numtofac, function(x) factor(mtcars[[x]]))
mtcars["mpg20"] <- ifelse(mtcars$mpg >=20, 1, 0)
# Get prediction data (standardizes for "disp")
# In my real use case the data for prediction is not created from my data for modelling
mtcars_vs0 <- mtcars
mtcars_vs0["vs"] <- factor(0, levels=c(0,1))
mtcars_vs1 <- mtcars
mtcars_vs1["vs"] <- factor(1, levels=c(0,1))
pred_mtcars <- rbind(mtcars_vs0, mtcars_vs1)[c("vs", "disp", "weight")]
# Logistic Regression Model
glm_mpg20 <- glm(mpg20 ~ vs*disp, family = binomial(link = "logit"), data=mtcars, weights = mtcars$weight)
# Prediction on logit scale
preds <- predict(glm_mpg20, newdata=pred_mtcars, type = "link", se.fit = TRUE)
#Calculate CIs for fitted values
critval <- 1.96 ## approx 95% CI
upr <- preds$fit + (critval * preds$se.fit)
lwr <- preds$fit - (critval * preds$se.fit)
fit <- preds$fit
# Transform from logit to propability scale
fit2 <- glm_mpg20$family$linkinv(fit)
upr2 <- glm_mpg20$family$linkinv(upr)
lwr2 <- glm_mpg20$family$linkinv(lwr)
# combine fitted values and CIs in one dataframe
fitted_mpg20 <- data.frame(propability=fit2,
lwr=lwr2,
upr=upr2)
# bind CIs and data for prediction
predicted_data <- cbind(pred_mtcars, fitted_mpg20)
# calculate mean propability for vs=0 and vs=1
mean_prop <- predicted %>%
group_by(vs) %>%
summarise(mean_propability = sum(propability*weight)/sum(weight))
Thank you very much for your help and let me know if you need something else from me