0

TLDR Probably this problem but how can we do it using sklearn? I'm okay if only the mean over the CVs I did for each lambda or alpha are shown in the plots.


Hi all, if I understand correctly, we need to cross-validate on the training set to select the alpha (as in sklearn) for the ridge regression. In particular, I want to perform a 5-fold CV repeated 5 times (so 25 CVs) on the training set.

What I want to do is for each alpha from the alphas:

from numpy import logspace as logs
alphas = logs(-3, 3, 71) # from 10^{-3}, 10^{-2.9}, ..., to 10^3

I get the MSEs on the 25 (different?) validation sets, and the MSE on the test set after I finish all the CVs for each training set, then take the average out of the 25 MSEs for plotting or reporting.

The issue is I'm not sure how to do so. Is this the correct code to retrieve the 25 MSEs from the validation sets which we usually couldn't observe?

from sklearn.linear_model import RidgeCV, Ridge
from sklearn.model_selection import cross_val_score as CVS

# 5 fold now repeated all 5 times
cvs = RKF(n_splits=5, n_repeats=5, random_state=42)

# each alpha input as al
# the whole data set is generated with different RNG each time
# if you like you may take any existing data sets to explain whether I did wrong
# for each whole data set, the training set is split using the same random state
CVS(Ridge(alpha=al, random_state=42), X_train, Y_train, scoring="neg_mean_squared_error", cv=cvs)

If no, should I use cross_validate or even RidgeCV to get the MSEs I want? Thanks in advance.

Yuki.F
  • 141
  • 5

1 Answers1

0

Most likely you need to use GridSearchCV, using an example where we have 10 values of alpha:

from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import cross_val_score, RepeatedKFold
from sklearn.linear_model import RidgeCV,Ridge
from sklearn.model_selection import cross_val_score
from numpy import logspace as logs
from sklearn import datasets

alphas = logs(-3, 3, 71) 

diabetes = datasets.load_diabetes()
X = diabetes.data[:300]
y = diabetes.target[:300]

X_val = diabetes.data[300:]
y_val = diabetes.target[300:]

We define the repeated cross validation, and the alphas to fit over:

cvs = RepeatedKFold(n_splits=5, n_repeats=5, random_state=42)
parameters = {'alpha':alphas}

clf = GridSearchCV(Ridge(), parameters,cv=cvs)
clf.fit(X, y)

So the means of the scores will be stored under clf.cv_results_['mean_test_score'] and you also have the individual results under the dictionary. To plot, you can simply do:

import matplotlib.pyplot as plt
fig, ax = plt.subplots()

ax.bar(np.arange(len(alphas)), height =clf.cv_results_['mean_test_score'],
       yerr=clf.cv_results_['std_test_score'], alpha=0.5,
       error_kw=dict(ecolor='gray', lw=1, capsize=5, capthick=2))

ax.set_xticks(np.arange(len(alphas)))
ax.set_xticklabels(np.round(alphas,3))

enter image description here

This shows the mean and standard error of the score over 10 values of alpha.

You can see this post on how to get the scores for a pre-defined validation set.

StupidWolf
  • 45,075
  • 17
  • 40
  • 72