TLDR Probably this problem but how can we do it using sklearn
? I'm okay if only the mean over the CVs I did for each lambda
or alpha
are shown in the plots.
Hi all, if I understand correctly, we need to cross-validate on the training set to select the alpha
(as in sklearn
) for the ridge regression. In particular, I want to perform a 5-fold CV repeated 5 times (so 25 CVs) on the training set.
What I want to do is for each alpha
from the alphas
:
from numpy import logspace as logs
alphas = logs(-3, 3, 71) # from 10^{-3}, 10^{-2.9}, ..., to 10^3
I get the MSEs on the 25 (different?) validation sets, and the MSE on the test set after I finish all the CVs for each training set, then take the average out of the 25 MSEs for plotting or reporting.
The issue is I'm not sure how to do so. Is this the correct code to retrieve the 25 MSEs from the validation sets which we usually couldn't observe?
from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import cross_val_score as CVS
# 5 fold now repeated all 5 times
cvs = RKF(n_splits=5, n_repeats=5, random_state=42)
# each alpha input as al
# the whole data set is generated with different RNG each time
# if you like you may take any existing data sets to explain whether I did wrong
# for each whole data set, the training set is split using the same random state
CVS(Ridge(alpha=al, random_state=42), X_train, Y_train, scoring="neg_mean_squared_error", cv=cvs)
If no, should I use cross_validate
or even RidgeCV
to get the MSEs I want? Thanks in advance.