Explicitly specifying test/train sets in GridSearchCV

Question

I have a question about the cv parameter of sklearn's GridSearchCV.

I'm working with data that has a time component to it, so I don't think random shuffling within KFold cross-validation seems sensible.

Instead, I want to explicitly specify cutoffs for training, validation, and test data within a GridSearchCV. Can I do this?

To better illuminate the question, here's how I would to that manually.

import numpy as np
import pandas as pd
from sklearn.linear_model import Ridge
np.random.seed(444)

index = pd.date_range('2014', periods=60, freq='M')
X, y = make_regression(n_samples=60, n_features=3, random_state=444, noise=90.)
X = pd.DataFrame(X, index=index, columns=list('abc'))
y = pd.Series(y, index=index, name='y')

# Train on the first 30 samples, validate on the next 10, test on
#     the final 10.
X_train, X_val, X_test = np.array_split(X, [35, 50])
y_train, y_val, y_test = np.array_split(y, [35, 50])

param_grid = {'alpha': np.linspace(0, 1, 11)}
model = None
best_param_ = None
best_score_ = -np.inf

# Manual implementation
for alpha in param_grid['alpha']:
    ridge = Ridge(random_state=444, alpha=alpha).fit(X_train, y_train)
    score = ridge.score(X_val, y_val)
    if score > best_score_:
        best_score_ = score
        best_param_ = alpha
        model = ridge

print('Optimal alpha parameter: {:0.2f}'.format(best_param_))
print('Best score (on validation data): {:0.2f}'.format(best_score_))
print('Test set score: {:.2f}'.format(model.score(X_test, y_test)))
# Optimal alpha parameter: 1.00
# Best score (on validation data): 0.64
# Test set score: 0.22

The process here is:

For both X and Y, I want a training set, validation set, and testing set. The training set is the first 35 samples in the time series. The validation set is the next 15 samples. The test set is the final 10.
The train and validation sets are use to determine the optimal alpha parameter within Ridge regression. Here I test alphas of (0.0, 0.1, ..., 0.9, 1.0).
The test set is held out for the "actual" testing as unseen data.

Anyways ... It seems like I'm looking to do something like this, but am not sure what to pass to cv here:

from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv= ???)
grid_search.fit(...?)

The docs, which I'm having trouble interpreting, specify:

cv : int, cross-validation generator or an iterable, optional

Determines the cross-validation splitting strategy. Possible inputs for cv are:

None, to use the default 3-fold cross validation,

integer, to specify the number of folds in a (Stratified)KFold,

An object to be used as a cross-validation generator.

An iterable yielding train, test splits.

For integer/None inputs, if the estimator is a classifier and y is either binary or multiclass, StratifiedKFold is used. In all other cases, KFold is used.

See also https://stackoverflow.com/q/31948879/10495893 – Ben Reiniger May 30 '21 at 17:54 — Ben Reiniger, May 30 '21 at 17:54

score 23 · Accepted Answer · edited Jan 23 '18 at 14:45

23

As @MaxU said, its better to let the GridSearchCV handle the splits, but if you want to enforce the splitting as you have set in the question, then you can use the PredefinedSplit which does this very thing.

So you need to make the following changes to your code.

# Here X_test, y_test is the untouched data
# Validation data (X_val, y_val) is currently inside X_train, which will be split using PredefinedSplit inside GridSearchCV
X_train, X_test = np.array_split(X, [50])
y_train, y_test = np.array_split(y, [50])


# The indices which have the value -1 will be kept in train.
train_indices = np.full((35,), -1, dtype=int)

# The indices which have zero or positive values, will be kept in test
test_indices = np.full((15,), 0, dtype=int)
test_fold = np.append(train_indices, test_indices)

print(test_fold)
# OUTPUT: 
array([-1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1, -1,
       -1,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0,  0])

from sklearn.model_selection import PredefinedSplit
ps = PredefinedSplit(test_fold)

# Check how many splits will be done, based on test_fold
ps.get_n_splits()
# OUTPUT: 1

for train_index, test_index in ps.split():
    print("TRAIN:", train_index, "TEST:", test_index)

# OUTPUT: 
('TRAIN:', array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
   17, 18, 19, 20, 21, 22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33,
   34]), 
 'TEST:', array([35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 46, 47, 48, 49]))


# And now, send this `ps` to cv param in GridSearchCV
from sklearn.model_selection import GridSearchCV
grid_search = GridSearchCV(Ridge(random_state=444), param_grid, cv=ps)

# Here, send the X_train and y_train
grid_search.fit(X_train, y_train)

The X_train, y_train sent to fit() will be split into train and test (val in your case) using the split we defined and hence, the Ridge will be trained on original data from indices [0:35] and tested on [35:50].

Hope this clears the working.

edited Jan 23 '18 at 14:45

Brad Solomon

38,521
31
149
235

answered Jan 23 '18 at 10:06

Vivek Kumar

35,217
8
109
132

1

Just to make sure I am following -- by "the indices which have zero or positive values, will be kept in test" -- you mean what is also commonly referred to as validation set, correct? I.e. the "test' on which the optimal parameters are determined. – Brad Solomon Jan 23 '18 at 14:42
1

@BradSolomon Yes. The test set on which the grid-search will score the params. You actual test set from index 50and above is untouched. – Vivek Kumar Jan 23 '18 at 14:46
@BradSolomon You can look at my [other answer](https://stackoverflow.com/a/42230764/3374996). Look at steps 4 to 8. – Vivek Kumar Jan 23 '18 at 14:48
2

You could also do `test_fold = np.repeat([-1, 0], [35, 15])` here to save a few lines – Brad Solomon Feb 11 '18 at 08:22
@VivekKumar So if I understand this correctly, if one uses gridsearch on the entire dataset, there's no need for manually making another validation set? A training set and a test set should be enough? – Riley Jul 27 '18 at 08:10
1

@Riley I am not understanding the question entirely. Yes it would not be needed to split the data for keeping another validation set if the problem allows it. For example, some people have data already split into train and test and they can only use train data for fitting. In that case, they may use the entire training data in grid-search which will split the data according to folds. They may split the data beforehand to keep a validation set away from grid-search if they want to. – Vivek Kumar Jul 27 '18 at 08:16
@VivekKumar, I'm just a bit confused by the use of the three different sets (train, validation, and test). I'm running SVM regression and I have tuned my parameters using gridsearch on my training set (consisting of 60% of the data). As I understand it, the validation set is used to prevent overfitting, but with SVM-RBF and gridsearch, you have parameters effectively preventing that, which leads me to believe that an extra validation set isn't really necessary in this case. Apologies for the long comment, I'm starting to believe that I should have posted this as a new question. – Riley Jul 27 '18 at 08:21
@VivekKumar what if I want to have 3 or more sets? For example, if I have 3 sets, I can train on First and Second sets and test on the Third one. In the next iteration, train on the 2nd and 3rd and test on the 1st and so on... – Balki Nov 14 '19 at 11:50
@Balki Whatever you want to be in a test set, you need to index it as 0 or positive. For three sets, if you want each set to be test once, you can assign 0, 1, 2 respectively to each set. Then when 0 and 1 are in training, 2 will be tested and so on... – Vivek Kumar Nov 14 '19 at 12:35
@VivekKumar Thanks vivek. Just to be clear if I have 6 rows and if I set [0,0,1,1,2,2], my cross validation will work by first training on the first 4 rows and test on the last 2 rows. Next time, it will train on the last 4 rows and test on the first 2 rows and so on.. correct? – Balki Nov 14 '19 at 12:47
@Balki Yes. But if you only need to split the rows in different folds as you show above, you can directly use [KFold](https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.KFold.html) – Vivek Kumar Nov 14 '19 at 12:57
@VivekKumar True, but my folds are all not of equal length. so, i thing predefined split will help – Balki Nov 14 '19 at 15:06
If i have already given the split data, i would need to concatenate it again and adivese the indices, is there a way to directly work with the splitted data in GridSearchCV? – user23657 May 09 '21 at 08:57
1

@user23657 No, I am afraid it cannot be done in `GridSearchCV` – Vivek Kumar May 11 '21 at 10:35

Bert Kellerman · Answer 2 · 2019-06-26T15:05:46.063

6

Have you tried TimeSeriesSplit?

It was made explicitly for splitting time series data.

tscv = TimeSeriesSplit(n_splits=3)
grid_search = GridSearchCV(clf, param_grid, cv=tscv.split(X))

edited Jun 26 '19 at 15:05

answered Jan 23 '18 at 16:34

Bert Kellerman

1,590
10
17

rohan chikorde · Answer 3 · 2020-09-08T13:07:40.220

In time series data, Kfold is not a right approach as kfold cv will shuffle your data and you will lose pattern within series. Here is an approach

import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import numpy as np
X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])
tscv = TimeSeriesSplit(n_splits=2)

model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}

my_cv = TimeSeriesSplit(n_splits=2).split(X)
gsearch = GridSearchCV(estimator=model, cv=my_cv,
                        param_grid=param_search)
gsearch.fit(X, y)

reference - How do I use a TimeSeriesSplit with a GridSearchCV object to tune a model in scikit-learn?

Explicitly specifying test/train sets in GridSearchCV

3 Answers3

Linked