How to split dataset as train and test data into rows using date like first 90%(from 2018-01-01 until 2019-02-01) would be train & last 10%(from 2019-02-02 ) would be test data in python?Not splitting randomly?
Asked
Active
Viewed 878 times
0
-
I believe this `df_train = df[df['date'] < '2019-02-01']` and this `df_test = df[df['date'] > '2019-02-02']` should do the trick – Louis Jun 02 '20 at 09:50
-
@Louis this code --> **from sklearn.model_selection import train_test_split train_features, test_features, train_labels,test_labels = train_test_split(features, labels,test_size = 0.25, random_state = 42)** splits the data, i want something similar but splitting the data using date. – Themba Mahlasela Jun 02 '20 at 10:10
-
If you order your dataframe by the date and then use `sklearn.train_test_split` with the parameter `shuffle` set to `False` it should allow you to get the result you want. – Louis Jun 02 '20 at 11:48
2 Answers
0
As explained in this SO post you can split it with np.split
:
import numpy as np
df = df.sort_values('date')
data = df.values
train_set, test_set= np.split(data, [int(.9 * len(data))])

above_c_level
- 3,579
- 3
- 22
- 37
0
If your data is already sorted based on time/date within pandas dataframe then simply use shuffle=False
from sklearn.model_selection import train_test_split
#target_attribute = df['column_name']
#You should drop target column before, you put it into train_test_split.
#df = df.drop(columns = ['column_name'], axis = 1)
trainingSet, testSet = train_test_split(df,
#target_attribute,
test_size=0.2,
random_state=42,
#stratify=y,
shuffle=False)

Mario
- 1,631
- 2
- 21
- 51