0

I am working on an exercise for a statistics online course. I need to make a logistic multivariate regresssion in R using a dataset which contains 19 variables of which one is a predictor binary variable. I want to apply the bootstrap method and stepwise backward elimination to make a model that contains fewer variables and only variables that appear in at least 70% of the bootstrap models.

My code is here:

# Step 1: Multiple imputation using chained equations (MICE)
library(mice)
library(mitools)

Rubin_imp <- mice(Final_model_above_5, m = 40)  # Generate 40 imputed datasets

# Combine the imputed datasets using Rubin's rules
imputed_data <- complete(Rubin_imp, "long", include = TRUE)

# Step 2: Bootstrap linear regression model with sandwich variance estimator
library(boot)
library(car)

# Set the seed for reproducibility
set.seed(12345)

# Define the formula for the regression model
formula <- eaten_within_3_months ~ .

# Create a variable occurrence counter
variable_counts <- rep(0, length(names(Final_model_above_5)))

# Define a function to perform the bootstrap regression, count variable occurrences, and track iteration number
bootstrap_regression <- function(data, indices) {
  # Get the bootstrap sample
  bootstrap_sample <- data[indices, ]
  
  # Fit the regression model using backward selection
  model <- stepAIC(glm(formula, data = bootstrap_sample, family = "binomial"), 
                   direction = "backward", 
                   scope = list(lower = ~1, upper = ~.))
  
  # Count the occurrences of variables selected in the model
  selected_variables <- names(coef(model))
  variable_counts[selected_variables] <- variable_counts[selected_variables] + 1
  
  # Return the model coefficients
  return(coef(model))
}

# Perform bootstrap sampling and regression
boot_samples <- boot(imputed_data, bootstrap_regression, R = 10)

# Extract the model coefficients from the boot_samples object
coefficients_list <- boot_samples$t

# Print the results and variable occurrences
print(coefficients_list)
cat("\n")

# Step 3: Retain variables that appeared at least 70% of the time
n_boot <- 10  # Number of bootstrap samples
threshold <- n_boot * 0.7  # Minimum occurrence threshold (70% of the bootstraps)

# Get the variable names that appeared at least threshold times
selected_variables <- names(variable_counts[variable_counts >= threshold])

# Print the selected variables
cat("Selected Variables:", selected_variables, "\n")


BUT I keep getting an error saying:

Error in t.star[r, ] <- res[[r]] : 
  number of items to replace is not a multiple of replacement length

I am not sure what I am doing wrong but I would be grateful is anyone had a solution/ an easier way of doing this.

  • I don't know the exact source of your problem, but `variable_counts` is only being updated within the `bootstrap_regression` function so its value is going to be unaffected outside of that function. So it will always be zeros. – George Savva Jul 14 '23 at 11:45

1 Answers1

0

Answering this question can be a bit daunting if the problem is not reproducible. See this reference for information on that: minimal reproducible example.

Regarding the question, the error message seems similar to one discussed here, and it could have something to do with first having to expand factor levels using model.matrix().

ykkunkels
  • 26
  • 4