How do I access the datasets after running k-fold with scikit-learn?

Question

I'm trying to apply the kfold method, but I don't know how to access the training and testing sets generated. After going through several blogs and scikitlearn user guide, the only thing people do is to print the training and testing sets. This could work for a small dataframe, but it's not useful when it comes to larger dataframes. Can anyone help me?

The data I'm using: https://github.com/ageron/handson-ml/tree/master/datasets/housing

Where I'm currently at:

X = housing[['total_rooms', 'total_bedrooms']]
y = housing['median_house_value']

kf = KFold(n_splits=5) 

for train_index, test_index in kf.split(X):
    X_train, X_test = X.iloc[train_index], X.iloc[test_index]
    y_train, y_test = y.iloc[train_index], y.iloc[test_index]

But this is only useful to get the last dataset generated. I should be able to get all.

Thanks in advance.

What do you mean "*access the datasets*"? Access them for what? — desertnaut, Mar 28 '20 at 16:17
@desertnaut well, it could mean anything. The most obvious is to run the model using all of them. I guess I could do this inside the loop. But I wanted to be able to have e.g.: X_train_1, X_test_1, X_train_2, X_test_2, and so on... — dekio, Mar 28 '20 at 16:22
@desertnaut basically to see all the combinations if I wanted. Am I missing something? I mean, regarding the theory — dekio, Mar 28 '20 at 16:22

score 1 · Accepted Answer · answered Mar 28 '20 at 17:19

AFAIK, KFold (and in fact everything related to the cross validation process) is meant to provide temporary datasets, so that one is able, as you say, to use them on the fly for fitting & evaluating models as shown in Cross-validation metrics in scikit-learn for each data split.

Nevertheless, since Kfold.split() results in a Python generator, you can use the indices generated in order to get permanent subsets, albeit with some manual work. Here is an example with the Boston data:

from sklearn.model_selection import KFold
from sklearn.datasets import load_boston

X, y = load_boston(return_X_y=True)
n_splits = 3
kf = KFold(n_splits=n_splits, shuffle=True)

folds = [next(kf.split(X)) for i in range(n_splits)]

Now, for every k in range(n_splits), folds[k][0] contains the training indices and folds[k][1] the corresponding validation indices, so you can do:

X_train_1 = X[folds[0][0]]
X_test_1 = X[folds[0][1]]

and so on. Notice that the same indices are applicable to the labels y too.

How do I access the datasets after running k-fold with scikit-learn?

1 Answers1

Linked