2

I need to split my dataset into two splits: 80% and 20%.

My dataset looks like this:

PersonID    Timestamp   Foo1    Foo2    Foo3    Label
1   1626184812  6   5   2   1
2   616243602   8   5   2   1
2   634551342   4   8   3   1
2   1531905378  3   8   8   1
3   616243602   10  7   8   2
3   634551342   7   5   8   2
4   1626184812  7   9   1   2
4   616243602   5   7   9   1
4   634551342   9   1   6   2
4   1531905378  3   3   3   1
4   768303369   6   1   7   2
5   1626184812  5   7   8   2
5   616243602   6   2   6   1
6   1280851467  3   2   2   2
7   1626184812  10  1   10  1
7   616243602   6   3   6   2
7   1531905378  9   5   7   2
7   634551342   3   7   9   1
8   616243602   8   7   4   2
8   634551342   2   2   4   1

(Note, you should be able to use pd.read_clipboard() to get this data into a dataframe.)

What I am trying to accomplish is:

  • Split this dataset into an 80/20 split (training, testing)
  • The dataset should be mostly organized by Timestamp, meaning, the older data should be in training, and the newer data should be in testing
  • Additionally, a single Person should not be split between training and testing. For example, all of the samples for a given PersonID must be in one set or the other!

The first two points are accomplished in the minimal example below. The third point is what I am having trouble with. For example, using sklearn's train_test_split:

Minimal example below:

# Imports
import pandas as pd
from sklearn.model_selection import train_test_split

# Normal split
x = pd.read_clipboard()
train, test = train_test_split(x, train_size=0.8, test_size=0.20, random_state=8)

# Organizing it by time
x = pd.read_clipboard()
x = x.sort_values(by='Timestamp')
train, test = train_test_split(x, train_size=0.8, test_size=0.20, random_state=8)

I am struggling to figure out how to group the dataframe so that one person is not split across train and test. For example, in above, each PersonID in the test dataframe also appears in the train dataframe. How can I keep the proportions about equal while ensuring that PersonID is not split?

artemis
  • 6,857
  • 11
  • 46
  • 99

1 Answers1

2

The two conditions are a bit hard to put together in a rigorous way: older timestamps in train vs single PersonID either train or test but not both. here are two ideas.

For the older timestamps in train, probably not the most rigorous solution, but you can try to get the max (or the min or the mean up to you) of timestamp per personID and the count, then sort_values the max to cumsum the count. Split train-test with the max of the cumsum multiply by your 80%.

# calculate the cumsum of count
s = (
    x.groupby('PersonID')
     ['Timestamp'].agg(['max','count'])
     .sort_values('max')
     ['count'].cumsum()
)
s = s<s.max()*0.8 # get boolean mask for train personID

and then you get

train = x.loc[x['PersonID'].isin(s[s].index)]
print(train)
#     PersonID   Timestamp  Foo1  Foo2  Foo3  Label
# 0          1  1626184812     6     5     2      1
# 1          2   616243602     8     5     2      1
# 2          2   634551342     4     8     3      1
# 3          2  1531905378     3     8     8      1
# 4          3   616243602    10     7     8      2
# 5          3   634551342     7     5     8      2
# 6          4  1626184812     7     9     1      2
# 7          4   616243602     5     7     9      1
# 8          4   634551342     9     1     6      2
# 9          4  1531905378     3     3     3      1
# 10         4   768303369     6     1     7      2
# 13         6  1280851467     3     2     2      2
# 18         8   616243602     8     7     4      2
# 19         8   634551342     2     2     4      1


test = x.loc[x['PersonID'].isin(s[~s].index)]
print(test)
#     PersonID   Timestamp  Foo1  Foo2  Foo3  Label
# 11         5  1626184812     5     7     8      2
# 12         5   616243602     6     2     6      1
# 14         7  1626184812    10     1    10      1
# 15         7   616243602     6     3     6      2
# 16         7  1531905378     9     5     7      2
# 17         7   634551342     3     7     9      1

The other option is to use sklearn.model_selection.GroupShuffleSplit like:

from sklearn.model_selection import GroupShuffleSplit

gss = GroupShuffleSplit(n_splits=1, train_size=.8, random_state=42)

for train_idx, test_idx in gss.split(x.index, groups=x['PersonID']):
     print("TRAIN ID:", x.loc[train_idx,'PersonID'].unique(), 
           ", Timestamp:", x.loc[train_idx,'Timestamp'].mean(),)
     print("TEST ID:", x.loc[test_idx,'PersonID'].unique(), 
           ", Timestamp:", x.loc[test_idx,'Timestamp'].mean())
# TRAIN ID: [1 3 4 5 7 8] , Timestamp: 997267296.9375
# TEST ID: [2 6] , Timestamp: 1015887947.25

and increase the parameter n_splits and keep the the split that gives the smallest mean for the train or highest for the test.

Ben.T
  • 29,160
  • 6
  • 32
  • 54