11

I'm trying to solve a machine learning problem. I have a specific dataset with time-series element. For this problem I'm using well-known python library - sklearn. There are a lot of cross validation iterators in this library. Also there are several iterators for defining cross validation yourself. The problem is that I don't really know how to define simple cross validation for time series. Here is a good example of what I'm trying to get:

Suppose we have several periods (years) and we want to split our data set into several chunks as follows:

data = [1, 2, 3, 4, 5, 6, 7]

train: [1]                test: [2] (or test: [2, 3, 4, 5, 6, 7])
train: [1, 2]             test: [3] (or test: [3, 4, 5, 6, 7])
train: [1, 2, 3]          test: [4] (or test: [4, 5, 6, 7])
...
train: [1, 2, 3, 4, 5, 6] test: [7]

I can't really understand how to create this kind of cross validation with sklearn tools. Probably I should use PredefinedSplit from sklearn.cross_validation like that:

train_fraction  = 0.8
train_size      = int(train_fraction * X_train.shape[0])
validation_size = X_train.shape[0] - train_size

cv_split = cross_validation.PredefinedSplit(test_fold=[-1] * train_size + [1] * validation_size)

Result:

train: [1, 2, 3, 4, 5] test: [6, 7]

But still it's not so good as a previous data split

Demyanov
  • 901
  • 2
  • 10
  • 15
  • what are the variables in your data set? why is it important to use the time series to split, why not just split randomly? – maxymoo Nov 25 '15 at 23:31
  • 1
    You could generate the splits without the use of scikit-learn, as follows: `cv_split = [(data[:i], data[i:]) for i in range(1, len(data))]`. What do you think? – Dan Oneață Nov 25 '15 at 23:33
  • @maxymoo, The reason not to split randomly with time series data is that time might matter (not just the other features you've identified) but "in the wild" you never get to train your model on data from the future. So in testing your model, you should behave similarly and not train on data from after the test date(s). – dslack Nov 25 '15 at 23:34
  • @DanOneață I'm sorry that I have not mention this in the question, but after the cretion of `PredifinedSplit` I put it into `RFECV` which require cross-validation generator or an iterable yielding train/test splits. So I though may be I can solve the problem with sklearn tools – Demyanov Nov 25 '15 at 23:46
  • 1
    @Demyanov But `cv_split` as I've defined it above is an iterable yielding a train/test split, if we consider `data` to be the indices of the data. – Dan Oneață Nov 26 '15 at 00:07

2 Answers2

7

You can obtain the desired cross-validation splits without using sklearn. Here's an example

import numpy as np

from sklearn.svm import SVR
from sklearn.feature_selection import RFECV

# Generate some data.
N = 10
X_train = np.random.randn(N, 3)
y_train = np.random.randn(N)

# Define the splits.
idxs = np.arange(N)
cv_splits = [(idxs[:i], idxs[i:]) for i in range(1, N)]

# Create the RFE object and compute a cross-validated score.
svr = SVR(kernel="linear")
rfecv = RFECV(estimator=svr, step=1, cv=cv_splits)
rfecv.fit(X_train, y_train)
Dan Oneață
  • 968
  • 7
  • 14
  • Won't this create a separate split for each observation while windowing forward? If I want to decrease this should I use the `step` parameter in range to make it go up in larger 'chunks'? – dreyco676 Aug 15 '16 at 04:22
  • 1
    @dreyco676 That's right. Just use a `step` parameter greater than one, for example, `cv_splits = [(idxs[:i], idxs[i:]) for i in range(1, N, 2)]` – Dan Oneață Aug 15 '16 at 14:12
  • Just to be sure: that StratifiedKFold was left behind, right? – paulochf Sep 09 '16 at 19:35
  • 1
    @paulochf Are you referring to the import of `StratifiedKFold`? You are right, that is unused—I'm going to remove it from the code snippet. – Dan Oneață Sep 11 '16 at 14:53
4

Meanwhile this was added to the library: http://scikit-learn.org/stable/modules/cross_validation.html#time-series-split

Example from the doc:

>>> from sklearn.model_selection import TimeSeriesSplit

>>> X = np.array([[1, 2], [3, 4], [1, 2], [3, 4], [1, 2], [3, 4]])
>>> y = np.array([1, 2, 3, 4, 5, 6])
>>> tscv = TimeSeriesSplit(n_splits=3)
>>> print(tscv)  
TimeSeriesSplit(n_splits=3)
>>> for train, test in tscv.split(X):
...     print("%s %s" % (train, test))
[0 1 2] [3]
[0 1 2 3] [4]
[0 1 2 3 4] [5]
Marcus V.
  • 6,323
  • 1
  • 18
  • 33