38

I've searched the sklearn docs for TimeSeriesSplit and the docs for cross-validation but I haven't been able to find a working example.

I'm using sklearn version 0.19.

This is my setup

import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit
from sklearn.grid_search import GridSearchCV
import numpy as np
X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])
tscv = TimeSeriesSplit(n_splits=2)
for train, test in tscv.split(X):
    print(train, test)

gives:

[0 1] [2 3]
[0 1 2 3] [4 5]

If I try:

model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}

my_cv = TimeSeriesSplit(n_splits=2).split(X)
gsearch = GridSearchCV(estimator=model, cv=my_cv,
                        param_grid=param_search)
gsearch.fit(X, y)

it gives: TypeError: object of type 'generator' has no len()

I get the problem: GridSearchCV is trying to call len(cv) but my_cv is an iterator without length. However, the docs for GridSearchCV state I can use a

int, cross-validation generator or an iterable, optional

I tried using TimeSeriesSplit without the .split(X) but it still didn't work.

I'm sure I'm overlooking something simple, thanks!!

cd98
  • 3,442
  • 2
  • 35
  • 51
  • 1
    Try using `my_cv = [(train,test) for train, test in TimeSeriesSplit(n_splits=2).split(X)]` – Vivek Kumar Oct 13 '17 at 15:12
  • that works, thanks! But shouldn't the function work with an iterator? When the number of observations is large (worse if the number of folds is large) I'd rather not hold those big arrays in memory if possible – cd98 Oct 13 '17 at 15:18
  • 1
    Yes it should. You should post an issue on the scikit-learn github page. – Vivek Kumar Oct 13 '17 at 16:00

1 Answers1

54

It turns out the problem was I was using GridSearchCV from sklearn.grid_search, which is deprecated. Importing GridSearchCV from sklearn.model_selection resolved the problem:

import xgboost as xgb
from sklearn.model_selection import TimeSeriesSplit, GridSearchCV
import numpy as np
X = np.array([[4, 5, 6, 1, 0, 2], [3.1, 3.5, 1.0, 2.1, 8.3, 1.1]]).T
y = np.array([1, 6, 7, 1, 2, 3])

model = xgb.XGBRegressor()
param_search = {'max_depth' : [3, 5]}

tscv = TimeSeriesSplit(n_splits=2)
gsearch = GridSearchCV(estimator=model, cv=tscv,
                        param_grid=param_search)
gsearch.fit(X, y)

gives:

GridSearchCV(cv=<generator object TimeSeriesSplit.split at 0x11ab4abf8>,
       error_score='raise',
       estimator=XGBRegressor(base_score=0.5, colsample_bylevel=1, colsample_bytree=1, gamma=0,
       learning_rate=0.1, max_delta_step=0, max_depth=3,
       min_child_weight=1, missing=None, n_estimators=100, nthread=-1,
       objective='reg:linear', reg_alpha=0, reg_lambda=1,
       scale_pos_weight=1, seed=0, silent=True, subsample=1),
       fit_params=None, iid=True, n_jobs=1,
       param_grid={'max_depth': [3, 5]}, pre_dispatch='2*n_jobs',
       refit=True, return_train_score=True, scoring=None, verbose=0)
sapo_cosmico
  • 6,274
  • 12
  • 45
  • 58
cd98
  • 3,442
  • 2
  • 35
  • 51
  • 5
    Maybe I'm doing something wrong but it seems to me that as of the current implementation the line my_cv = TimeSeriesSplit(n_splits=2).split(X) should actually be corrected to my_cv = TimeSeriesSplit(n_splits=2). Otherwise it will throw an error – Odisseo Feb 18 '19 at 20:29