0

I'm calling for help.

I'm using regsubsets to find the combination of independent variables that maximizes the adjusted r squared.

I choose an nvmax which corresponds to the total of my number of variables.

The algorithm seems to work well, allowing me to isolate the adjusted r squared for each model, to see which is the best, as well as the column names of the independent variables selected.

I've added a code that uses the variables retained by the algorithm to plot the regression and observe the coefficients.

Unfortunately, the regression does not at all show an adjusted r square equivalent to the best selected by regsubset. In fact, the result is rather mediocre and I can't explain the difference.

Here's my code below.

Thanks for your help

##############################################


Op = V[1:60,c(2:60)]  
attach(Op)  
dependent_var <- Op$Prequin.VC #select the VC Index for dependant variable  
independent_vars <- Op[, c(11:27,30:46,48:50)] #define all the columns of predictors  

**#test all combinations for independent variables**  
result <- regsubsets(dependent_var ~ ., data = independent_vars, nvmax = dim(independent_vars)[2], really.big = TRUE)  
 ##**Stock adjusted r squared for each models**  
adjr2_values <- summary(result)$adjr2  
adjr2_values  
**#Stock the model with the best Adjust R squared**  
best_model <- which.max(adjr2_values)  

**#Identify variables of best model**  
cat("Best variables combinations : ", paste(colnames(independent_vars) 
 [which(coef(result, id = best_model) != 0)], collapse = ", "), "\n")  
cat("Number of variables : ", paste(num_selected_variables <- length(coef(result, id = best_model))-1))  
cat("Best adjusted Rsquare : ", adjr2_values[best_model], "\n")  

**#plot summary lm of the best model**  
selected_vars <- colnames(independent_vars)[which(coef(result, id = best_model) != 0)]  
formula <- as.formula(paste("dependent_var ~", paste(selected_vars, collapse = " + ")))  
lm_model <- lm(formula)  
summary(lm_model)  

I think the problem may stem from the way I extracted the best model's predictors. Indeed, by displaying all the adjusted r squared values, I can see visually that a model with 18 predictors has the highest value. But my code then extracts the column names and adjusted squared values of the 19-predictor model.

However I don't get the adjusted r square of the 19-predictor model either when I use lm()...

r2evans
  • 141,215
  • 6
  • 77
  • 149
  • Welcome to SO, Cédric Delaveau! Questions on SO (especially in R) do much better if they are reproducible and self-contained. By that I mean including attempted code (you have a lot here, but please be explicit about non-base packages), sample representative data (perhaps via `dput(head(x))` or building data programmatically (e.g., `data.frame(...)`), possibly stochastically), perhaps actual output (with verbatim errors/warnings) versus intended output. Refs: https://stackoverflow.com/q/5963269, [mcve], and https://stackoverflow.com/tags/r/info. – r2evans Sep 03 '23 at 01:15

0 Answers0