2

I currently have a CSV that contains many rows (some 200k) with many columns on each. I basically want to have a time series training and test data split. I have many unique items inside of my dataset, and I want the first 80% (chronologically) of each to be in the training data. I wrote the following code to do so

import pandas as pd 
df = pd.read_csv('Data.csv')
df['Date'] = pd.to_datetime(df['Date'])
test = pd.DataFrame()
train = pd.DataFrame()
itemids = df.itemid.unique()
for i in itemids:
    df2 = df.loc[df['itemid'] == i]
    df2 = df2.sort_values(by='Date',ascending=True)
    trainvals = df2[:int(len(df2)*0.8)]
    testvals = df2[int(len(df2)*0.8):]
    train.append(trainvals)
    test.append(testvals)

It seems like trainvals and testvals are being populated properly, but they are not being added into test and train. Am I adding them in wrong?

  • 2
    [Never call `DataFrame.append` or `pd.concat` inside a for-loop. It leads to quadratic copying.](https://stackoverflow.com/a/36489724/1422451) – Parfait Sep 02 '20 at 02:19
  • @Parfait Ok so to avoid that I think I can have trainvals and testvals declared before the for loop, then just append them to train and test after the loop? – Sathvik Chinta Sep 02 '20 at 02:21
  • Are the time series concurrent for each item? i.e. all items have values from 9am to 5:30pm.. if so, simply sorting on time could allow for a single split – RichieV Sep 02 '20 at 02:48

2 Answers2

2

Your immediate issue is not re-assigning inside for-loop:

train = train.append(trainvals) 
test = test.append(testvals)

However, it becomes memory inefficient to grow extensive objects like data frames in a loop. Instead, consider iterating across groupby to build a list of dictionaries containing test and train splits via list comprehension. Then call pd.concat to bind each set together. Use a defined method to organize processing.

def split_dfs(df): 
   df = df.sort_values(by='Date') 
   trainvals = df[:int(len(df)*0.8)] 
   testvals = df[int(len(df)*0.8):] 

   return {'train': trainvals, 'test': testvals}

dfs =  [split_dfs(df) for g,df in df.groupby['itemid']]

train_df = pd.concat([x['train'] for x in dfs])
test_df = pd.concat(x['test'] for x in dfs])
Parfait
  • 104,375
  • 17
  • 94
  • 125
  • This worked just as I wanted. It's interesting that I need to say train = train.append(), I guess the append method creates a copy of the dataframe instead of using a reference to it in memory? – Sathvik Chinta Sep 02 '20 at 17:28
  • You might have confused Pandas' [`DataFrame.append`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.DataFrame.append.html) with Python's built-in `list.append`. Usually for most Pandas operations if you do not use an available `inplace=True` argument you need to assign back the return of method. Yes, copies are made but in your case not re-assigned to overwrite with each loop pass. – Parfait Sep 02 '20 at 17:35
1

You can avoid the loop with df.groupby.quantile.

train = df.groupby('itemid').quantile(0.8)
test = df.loc[~df.index.isin(train.index), :] # all rows not in train

Note this could have unexpected behavior if df.index is not unique.

RichieV
  • 5,103
  • 2
  • 11
  • 24