8

I want to do a time series cross validation based on group (grp column). In the below sample data, Temperature is my target variable

import numpy as np
import pandas as pd
timeS=pd.date_range(start='1980-01-01 00:00:00', end='1980-01-01 00:00:05', 
freq='S')
df = pd.DataFrame(dict(time=timeS, grp=['A']*3 + ['B']*3, material=[1,2,3]*2,
temperature=['2.4','5','9.9']*2))


    grp material    temperature    time
0   A   1       2.4                1980-01-01 00:00:00
1   A   2       5                  1980-01-01 00:00:01
2   A   3       9.9                1980-01-01 00:00:02
3   B   1       2.4                1980-01-01 00:00:03
4   B   2       5                  1980-01-01 00:00:04
5   B   3       9.9                1980-01-01 00:00:05

i am planing to add some lag features based on grp using this code.

df.groupby("grp")['temperature'].shift(-1)
0      5
1    9.9
2    NaN
3      5
4    9.9
5    NaN
Name: temperature, dtype: object

The problem now i have is when i do cross validation I can using this function from sklearn sklearn.model_selection.TimeSeriesSplit but it does not take into consideration of the group effect. Can anyone tell me how to do the CV split per group (like stratified split)? I am going to use xgboost.cv for cv if that helps.

Edit: Time changes per group. Time increases uniformly (per second) within the group

akilat90
  • 5,436
  • 7
  • 28
  • 42
XXavier
  • 1,206
  • 1
  • 10
  • 14
  • 2
    did you find a solution for this as I am looking for an approach myself? – SriK Apr 12 '18 at 07:30
  • 2
    Not really i manually loop through each group after sorting the group by time and sampled the last few rows for validation. It was a really time consuming process – XXavier May 03 '18 at 21:50
  • 1
    Ah, so you drop data in the splits to get an adequate proportionality ? Yeah, that would work if you have lots of data. – SriK May 04 '18 at 16:02
  • Looks like a similar question was asked after this one, which got some answers: https://stackoverflow.com/q/51963713/7619676 – ZaxR May 03 '21 at 17:51
  • xgboost.cv takes an argument folds which seems to allow the flexibility you need: folds (a KFold or StratifiedKFold instance or list of fold indices) – Sklearn KFolds or StratifiedKFolds object. *Alternatively may explicitly pass sample indices for each fold. For n folds, folds should be a length n list of tuples. Each tuple is (in,out) where in is a list of indices to be used as the training samples for the n th fold and out is a list of indices to be used as the testing samples for the n th fold.* https://xgboost.readthedocs.io/en/stable/python/python_api.html#xgboost.cv – Sean Pohorence Apr 30 '23 at 18:24

1 Answers1

-3

The following should do it:

    series = Series.from_csv('yourfile.csv', header=0)
    X = series.values
    n_train = 500
    n_records = len(X)
    for i in range(n_train, n_records):
        train, test = X[0:i], X[i:i+1]
        print('train=%d, test=%d' % (len(train), len(test)))
Bondeaux
  • 174
  • 1
  • 3
  • 10
  • How does this account for stratification? This solution does not even look at the group partitions. – SriK Apr 12 '18 at 07:28