Split Pandas Dataframe For Time Series

Question

I currently have a CSV that contains many rows (some 200k) with many columns on each. I basically want to have a time series training and test data split. I have many unique items inside of my dataset, and I want the first 80% (chronologically) of each to be in the training data. I wrote the following code to do so

import pandas as pd 
df = pd.read_csv('Data.csv')
df['Date'] = pd.to_datetime(df['Date'])
test = pd.DataFrame()
train = pd.DataFrame()
itemids = df.itemid.unique()
for i in itemids:
    df2 = df.loc[df['itemid'] == i]
    df2 = df2.sort_values(by='Date',ascending=True)
    trainvals = df2[:int(len(df2)*0.8)]
    testvals = df2[int(len(df2)*0.8):]
    train.append(trainvals)
    test.append(testvals)

It seems like trainvals and testvals are being populated properly, but they are not being added into test and train. Am I adding them in wrong?

[Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) — Parfait, Sep 02 '20 at 02:19
@Parfait Ok so to avoid that I think I can have trainvals and testvals declared before the for loop, then just append them to train and test after the loop? — Sathvik Chinta, Sep 02 '20 at 02:21
Are the time series concurrent for each item? i.e. all items have values from 9am to 5:30pm.. if so, simply sorting on time could allow for a single split — RichieV, Sep 02 '20 at 02:48

score 2 · Accepted Answer · answered Sep 02 '20 at 02:32

2

Your immediate issue is not re-assigning inside for-loop:

train = train.append(trainvals) 
test = test.append(testvals)

However, it becomes memory inefficient to grow extensive objects like data frames in a loop. Instead, consider iterating across groupby to build a list of dictionaries containing test and train splits via list comprehension. Then call pd.concat to bind each set together. Use a defined method to organize processing.

def split_dfs(df): 
   df = df.sort_values(by='Date') 
   trainvals = df[:int(len(df)*0.8)] 
   testvals = df[int(len(df)*0.8):] 

   return {'train': trainvals, 'test': testvals}

dfs =  [split_dfs(df) for g,df in df.groupby['itemid']]

train_df = pd.concat([x['train'] for x in dfs])
test_df = pd.concat(x['test'] for x in dfs])

answered Sep 02 '20 at 02:32

Parfait

104,375
17
94
125

This worked just as I wanted. It's interesting that I need to say train = train.append(), I guess the append method creates a copy of the dataframe instead of using a reference to it in memory? – Sathvik Chinta Sep 02 '20 at 17:28
You might have confused Pandas' [`DataFrame.append`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) with Python's built-in `list.append`. Usually for most Pandas operations if you do not use an available `inplace=True` argument you need to assign back the return of method. Yes, copies are made but in your case not re-assigned to overwrite with each loop pass. – Parfait Sep 02 '20 at 17:35

score 1 · Answer 2 · answered Sep 02 '20 at 02:42

1

You can avoid the loop with df.groupby.quantile.

train = df.groupby('itemid').quantile(0.8)
test = df.loc[~df.index.isin(train.index), :] # all rows not in train

Note this could have unexpected behavior if df.index is not unique.

answered Sep 02 '20 at 02:42

RichieV

5,103
2
11
24

Split Pandas Dataframe For Time Series

2 Answers2