10

I have dataset with 1000 rows like this

 Date,      Cost,         Quantity(in ton),    Source,          Unloading Station
    01/10/2015, 7,            5.416,               XYZ,           ABC

i want to split the data on the base of date. For e.g. till date 20.12.2016 is a training data and after that it is test data.

How should i split? Is it possible?

kush
  • 173
  • 1
  • 2
  • 11
  • simple loop would suffice – lejlot May 30 '16 at 18:55
  • Yes, it is possible to split data this way. Whether it's the right thing to do is contextual; your intention already seems to be to split it this way. I'm unclear on the type of answer you are expecting. Can you clarify the question? – roganjosh May 30 '16 at 18:57
  • @roganjosh there is a dataset with above labels( date,cost, quantity,source,destination). Now there is a specific date given (e.g. 1/10/2016), till this date i want my dataset as a training dataset and rest will be training. And on a particular date there are lots of quantity has been sent from source to destination. Just like this, dates are serialized, e.g. from 1/1/2015 to 1/1/2016 – kush May 30 '16 at 19:12
  • what is the type of your dataset? is it a pandas data frame? – MaxU - stand with Ukraine May 30 '16 at 19:57
  • @kush it's still not a question though, it's just a statement. How is your data read into Python? "How should I split?" is impossible to answer. "is it possible?" - almost certainly "yes". You need to clarify in the question what you are looking to do and, preferably, post what you have tried that doesn't work. – roganjosh May 30 '16 at 20:56
  • i was using pandas data frame and it was easy to split it into different sets on the basis of date – kush Dec 17 '16 at 10:40

3 Answers3

18

You can easily do that by converting your column to pandas to_datetime type and set it as index.

import pandas as pd
df['Date'] = pd.to_datetime(df['Date'])
df = df.set_index(df['Date'])
df = df.sort_index()

Once you have your data in this format, you can simply use date as index for creating partition as follows:

# create train test partition
train = df['2015-01-10':'2016-12-20']
test  = df['2016-12-21':]
print('Train Dataset:',train.shape)
print('Test Dataset:',test.shape)
Sayali Sonawane
  • 12,289
  • 5
  • 46
  • 47
12

assuming that your data set is pandas data frame and that Date column is of datetime dtype:

split_date = pd.datetime(2016,12,20)

df_training = df.loc[df['Date'] <= split_date]
df_test = df.loc[df['Date'] > split_date]
MaxU - stand with Ukraine
  • 205,989
  • 36
  • 386
  • 419
1

If your date is in standard python datetime format ie. '2016-06-23 23:00:00', you can use the code below

split_date ='2016-06-23 23:00:00' train_data = train_data.loc[train_data['Date'] <= split_date] validation_data = train_data.loc[train_data['Date'] > split_date]