I'm calling for help.
I'm using regsubsets to find the combination of independent variables that maximizes the adjusted r squared.
I choose an nvmax which corresponds to the total of my number of variables.
The algorithm seems to work well, allowing me to isolate the adjusted r squared for each model, to see which is the best, as well as the column names of the independent variables selected.
I've added a code that uses the variables retained by the algorithm to plot the regression and observe the coefficients.
Unfortunately, the regression does not at all show an adjusted r square equivalent to the best selected by regsubset. In fact, the result is rather mediocre and I can't explain the difference.
Here's my code below.
Thanks for your help
##############################################
Op = V[1:60,c(2:60)]
attach(Op)
dependent_var <- Op$Prequin.VC #select the VC Index for dependant variable
independent_vars <- Op[, c(11:27,30:46,48:50)] #define all the columns of predictors
**#test all combinations for independent variables**
result <- regsubsets(dependent_var ~ ., data = independent_vars, nvmax = dim(independent_vars)[2], really.big = TRUE)
##**Stock adjusted r squared for each models**
adjr2_values <- summary(result)$adjr2
adjr2_values
**#Stock the model with the best Adjust R squared**
best_model <- which.max(adjr2_values)
**#Identify variables of best model**
cat("Best variables combinations : ", paste(colnames(independent_vars)
[which(coef(result, id = best_model) != 0)], collapse = ", "), "\n")
cat("Number of variables : ", paste(num_selected_variables <- length(coef(result, id = best_model))-1))
cat("Best adjusted Rsquare : ", adjr2_values[best_model], "\n")
**#plot summary lm of the best model**
selected_vars <- colnames(independent_vars)[which(coef(result, id = best_model) != 0)]
formula <- as.formula(paste("dependent_var ~", paste(selected_vars, collapse = " + ")))
lm_model <- lm(formula)
summary(lm_model)
I think the problem may stem from the way I extracted the best model's predictors. Indeed, by displaying all the adjusted r squared values, I can see visually that a model with 18 predictors has the highest value. But my code then extracts the column names and adjusted squared values of the 19-predictor model.
However I don't get the adjusted r square of the 19-predictor model either when I use lm()
...