Compute cross validation for multi-variate linear regression

Question

I am training different models for a regression problem. Since i want to find the best model between the choices, i wanted to perform a cross validation with k = 20, to characterize the MSE of the models, and statistically determine what model is the better between them. The problem has got multiple dependant variables, and i would like to determinate the MSE separately for both dependant variables, but cross_val_score doesnt let me do that explicitely. Here is some example code of one of my models:

from sklearn.linear_model import LinearRegression
from sklearn.preprocessing import MinMaxScaler
from sklearn.model_selection import cross_val_score
from sklearn.model_selection import KFold
from sklearn.metrics import mean_squared_error

model = LinearRegression()

model.fit(x, y)

y_pred = model.predict(x_test)

mse = mean_squared_error(scaler2.inverse_transform(y_test), scaler2.inverse_transform(y_pred), multioutput="raw_values")

How can i iterate training on the k times corresponding to the k models trained and tested in a k fold cross validation? Scikit provides a Kfold but it is just a way to specify the number of folds, and it doesnt actually returns the training and test folds, so i can't think a way to actually train different models using kfold cross validation theory. Plus, i would need to evaluate MSE seprately on each dependant variable since it's a multiple regression problem

"*Scikit provides a Kfold but [...] it doesnt actually returns the training and test folds,*" - actually it does exactly that: https://stackoverflow.com/questions/54201464/cross-validation-metrics-in-scikit-learn-for-each-data-split/54202609#54202609 — desertnaut, Mar 10 '22 at 19:10

score 0 · Accepted Answer · answered Mar 10 '22 at 19:44

You can use Scikit Learn KFold Cross Validation with just a simple for loop.

And here is a example testing 5-fold cross validation on bayes classifer:

from sklearn.model_selection import KFold

k = 5
kf = KFold(n_splits=k)

res = []
for train_index , test_index in kf.split(X_train_concat):
    X_train_kf , X_test_kf = X_train_concat[train_index,:],X_train_concat[test_index,:]
    y_train_kf , y_test_kf = y_train_concat[train_index] , y_train_concat[test_index]
    
    X_train = np.append(X_train_concat, np.reshape(y_train_concat, (len(y_train_concat),1)), axis=1)
    W_bayes = trainBayes(X_train)
    y_pred = predict(X_test_kf, W_bayes)
    
    mis_classification = len(y_pred)-np.count_nonzero(y_pred == y_test_kf)
    e = (mis_classification / y_test_kf.shape[0]) * 100

    res.append(e)

avg_res = sum(res)/k
print('Result of each fold - {}'.format(res))
print('Avg result : {}'.format(avg_res))

For more check this

Compute cross validation for multi-variate linear regression

1 Answers1