I have a question about the cv
parameter of sklearn's GridSearchCV
.
I'm working with data that has a time component to it, so I don't think random shuffling within KFold cross-validation seems sensible.
Instead, I want to explicitly specify cutoffs for training, validation, and test data within a GridSearchCV
. Can I do this?
To better illuminate the question, here's how I would to that manually.
import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
np.random.seed(444)
index = pd.date_range('2014', periods=60, freq='M')
X, y = make_regression(n_samples=60, n_features=3, random_state=444, noise=90.)
X = pd.DataFrame(X, index=index, columns=list('abc'))
y = pd.Series(y, index=index, name='y')
# Train on the first 30 samples, validate on the next 10, test on
# the final 10.
X_train, X_val, X_test = np.array_split(X, [35, 50])
y_train, y_val, y_test = np.array_split(y, [35, 50])
param_grid = {'alpha': np.linspace(0, 1, 11)}
model = None
best_param_ = None
best_score_ = -np.inf
# Manual implementation
for alpha in param_grid['alpha']:
ridge = Ridge(random_state=444, alpha=alpha).fit(X_train, y_train)
score = ridge.score(X_val, y_val)
if score > best_score_:
best_score_ = score
best_param_ = alpha
model = ridge
print('Optimal alpha parameter: {:0.2f}'.format(best_param_))
print('Best score (on validation data): {:0.2f}'.format(best_score_))
print('Test set score: {:.2f}'.format(model.score(X_test, y_test)))
# Optimal alpha parameter: 1.00
# Best score (on validation data): 0.64
# Test set score: 0.22
The process here is:
- For both X and Y, I want a training set, validation set, and testing set. The training set is the first 35 samples in the time series. The validation set is the next 15 samples. The test set is the final 10.
- The train and validation sets are use to determine the optimal
alpha
parameter within Ridge regression. Here I testalpha
s of (0.0, 0.1, ..., 0.9, 1.0). - The test set is held out for the "actual" testing as unseen data.
Anyways ... It seems like I'm looking to do something like this, but am not sure what to pass to cv
here:
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv= ???)
grid_search.fit(...?)
The docs, which I'm having trouble interpreting, specify:
cv
: int, cross-validation generator or an iterable, optionalDetermines the cross-validation splitting strategy. Possible inputs for cv are:
- None, to use the default 3-fold cross validation,
- integer, to specify the number of folds in a (Stratified)KFold,
- An object to be used as a cross-validation generator.
- An iterable yielding train, test splits.
For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.