I am working on an exercise for a statistics online course. I need to make a logistic multivariate regresssion in R using a dataset which contains 19 variables of which one is a predictor binary variable. I want to apply the bootstrap method and stepwise backward elimination to make a model that contains fewer variables and only variables that appear in at least 70% of the bootstrap models.
My code is here:
# Step 1: Multiple imputation using chained equations (MICE)
library(mice)
library(mitools)
Rubin_imp <- mice(Final_model_above_5, m = 40) # Generate 40 imputed datasets
# Combine the imputed datasets using Rubin's rules
imputed_data <- complete(Rubin_imp, "long", include = TRUE)
# Step 2: Bootstrap linear regression model with sandwich variance estimator
library(boot)
library(car)
# Set the seed for reproducibility
set.seed(12345)
# Define the formula for the regression model
formula <- eaten_within_3_months ~ .
# Create a variable occurrence counter
variable_counts <- rep(0, length(names(Final_model_above_5)))
# Define a function to perform the bootstrap regression, count variable occurrences, and track iteration number
bootstrap_regression <- function(data, indices) {
# Get the bootstrap sample
bootstrap_sample <- data[indices, ]
# Fit the regression model using backward selection
model <- stepAIC(glm(formula, data = bootstrap_sample, family = "binomial"),
direction = "backward",
scope = list(lower = ~1, upper = ~.))
# Count the occurrences of variables selected in the model
selected_variables <- names(coef(model))
variable_counts[selected_variables] <- variable_counts[selected_variables] + 1
# Return the model coefficients
return(coef(model))
}
# Perform bootstrap sampling and regression
boot_samples <- boot(imputed_data, bootstrap_regression, R = 10)
# Extract the model coefficients from the boot_samples object
coefficients_list <- boot_samples$t
# Print the results and variable occurrences
print(coefficients_list)
cat("\n")
# Step 3: Retain variables that appeared at least 70% of the time
n_boot <- 10 # Number of bootstrap samples
threshold <- n_boot * 0.7 # Minimum occurrence threshold (70% of the bootstraps)
# Get the variable names that appeared at least threshold times
selected_variables <- names(variable_counts[variable_counts >= threshold])
# Print the selected variables
cat("Selected Variables:", selected_variables, "\n")
BUT I keep getting an error saying:
Error in t.star[r, ] <- res[[r]] :
number of items to replace is not a multiple of replacement length
I am not sure what I am doing wrong but I would be grateful is anyone had a solution/ an easier way of doing this.