0

How to split dataset as train and test data into rows using date like first 90%(from 2018-01-01 until 2019-02-01) would be train & last 10%(from 2019-02-02 ) would be test data in python?Not splitting randomly?

  • I believe this `df_train = df[df['date'] < '2019-02-01']` and this `df_test = df[df['date'] > '2019-02-02']` should do the trick – Louis Jun 02 '20 at 09:50
  • @Louis this code --> **from sklearn.model_selection import train_test_split train_features, test_features, train_labels,test_labels = train_test_split(features, labels,test_size = 0.25, random_state = 42)** splits the data, i want something similar but splitting the data using date. – Themba Mahlasela Jun 02 '20 at 10:10
  • If you order your dataframe by the date and then use `sklearn.train_test_split` with the parameter `shuffle` set to `False` it should allow you to get the result you want. – Louis Jun 02 '20 at 11:48

2 Answers2

0

As explained in this SO post you can split it with np.split:

import numpy as np
df = df.sort_values('date') 
data = df.values
train_set, test_set= np.split(data, [int(.9 * len(data))])
above_c_level
  • 3,579
  • 3
  • 22
  • 37
0

If your data is already sorted based on time/date within pandas dataframe then simply use shuffle=False

from sklearn.model_selection import train_test_split

#target_attribute = df['column_name'] 
#You should drop target column before, you put it into train_test_split. 
#df = df.drop(columns = ['column_name'], axis = 1)

trainingSet, testSet = train_test_split(df,
                                        #target_attribute, 
                                        test_size=0.2,
                                        random_state=42,
                                        #stratify=y,
                                        shuffle=False)
Mario
  • 1,631
  • 2
  • 21
  • 51