I need to split my dataset into two splits: 80% and 20%.
My dataset looks like this:
PersonID Timestamp Foo1 Foo2 Foo3 Label
1 1626184812 6 5 2 1
2 616243602 8 5 2 1
2 634551342 4 8 3 1
2 1531905378 3 8 8 1
3 616243602 10 7 8 2
3 634551342 7 5 8 2
4 1626184812 7 9 1 2
4 616243602 5 7 9 1
4 634551342 9 1 6 2
4 1531905378 3 3 3 1
4 768303369 6 1 7 2
5 1626184812 5 7 8 2
5 616243602 6 2 6 1
6 1280851467 3 2 2 2
7 1626184812 10 1 10 1
7 616243602 6 3 6 2
7 1531905378 9 5 7 2
7 634551342 3 7 9 1
8 616243602 8 7 4 2
8 634551342 2 2 4 1
(Note, you should be able to use pd.read_clipboard()
to get this data into a dataframe.)
What I am trying to accomplish is:
- Split this dataset into an 80/20 split (training, testing)
- The dataset should be mostly organized by
Timestamp
, meaning, the older data should be in training, and the newer data should be in testing - Additionally, a single Person should not be split between training and testing. For example, all of the samples for a given PersonID must be in one set or the other!
The first two points are accomplished in the minimal example below. The third point is what I am having trouble with. For example, using sklearn's train_test_split
:
Minimal example below:
# Imports
import pandas as pd
from sklearn.model_selection import train_test_split
# Normal split
x = pd.read_clipboard()
train, test = train_test_split(x, train_size=0.8, test_size=0.20, random_state=8)
# Organizing it by time
x = pd.read_clipboard()
x = x.sort_values(by='Timestamp')
train, test = train_test_split(x, train_size=0.8, test_size=0.20, random_state=8)
I am struggling to figure out how to group the dataframe so that one person is not split across train and test. For example, in above, each PersonID
in the test
dataframe also appears in the train
dataframe. How can I keep the proportions about equal while ensuring that PersonID is not split?