I currently have a CSV that contains many rows (some 200k) with many columns on each. I basically want to have a time series training and test data split. I have many unique items inside of my dataset, and I want the first 80% (chronologically) of each to be in the training data. I wrote the following code to do so
import pandas as pd
df = pd.read_csv('Data.csv')
df['Date'] = pd.to_datetime(df['Date'])
test = pd.DataFrame()
train = pd.DataFrame()
itemids = df.itemid.unique()
for i in itemids:
df2 = df.loc[df['itemid'] == i]
df2 = df2.sort_values(by='Date',ascending=True)
trainvals = df2[:int(len(df2)*0.8)]
testvals = df2[int(len(df2)*0.8):]
train.append(trainvals)
test.append(testvals)
It seems like trainvals and testvals are being populated properly, but they are not being added into test and train. Am I adding them in wrong?