4

I have a data set of subjects and each of them has a number of rows in my pandas dataframe (each measurement is a row and a subject could measure a few times). I would like to split my data into training and test set but I cannot split randomly because all subject's measurements are dependent (cannot put the same subject in the train and test). How would you reslove this? I have a pandas dataframe and each subject has a different number of measurements.

Edit: My data includes the subject number for each row and I would like to split as close to 0.8/0.2 as possible.

AR_
  • 468
  • 6
  • 18
  • 1
    [Can you provide your dataframe, and your expected output?](https://stackoverflow.com/questions/20109391/how-to-make-good-reproducible-pandas-examples) – Tony Aug 31 '17 at 11:48
  • Unfortunatly I can't. I can give an example: A data of 3 subjects, subject 1 measured 3 times, subject 2 measured 4 times, subject 3 measured 3 times. A total of 10 lines and I would like to split them as close as I can to let's say 0.8/0.2. So the training would include 2 subjects with 7 measurements and the test would include 1 subject with 3 measurements. – AR_ Aug 31 '17 at 11:51
  • How can you tell where one subject starts and stops? Are there columns, or are they multi-indexed? – Tony Aug 31 '17 at 12:02
  • As in my edit above, I have a column with subject number so you can tell for each row which subject was it. – AR_ Aug 31 '17 at 12:04

1 Answers1

3

Consider the dataframe df with column user_id to identify users.

df = pd.DataFrame(
    np.random.randint(5, size=(100, 4)), columns=['user_id'] + list('ABC')
)

You want to identify unique users and randomly select some. Then split your dataframe in order to put all test users in one and train users in the other.

unique_users = df['user_id'].unique()
train_users, test_users = np.split(
    np.random.permutation(unique_users), [int(.8 * len(unique_users))]
)

df_train = df[df['user_id'].isin(train_users)]
df_test = df[df['user_id'].isin(test_users)]

This should roughly split your data into 80/20.


However, if you care to keep it as balanced as possible, then you must add users incrementally.

unique_users = df['user_id'].unique()
target_n = int(.8 * len(df))
shuffled_users = np.random.permutation(unique_users)

user_count = df['user_id'].value_counts()

mapping = user_count.reindex(shuffled_users).cumsum() <= target_n
mask = df['user_id'].map(mapping)

df_train = df[mask]
df_test = df[~mask]
piRSquared
  • 285,575
  • 57
  • 475
  • 624
  • Thank you for your answer. The problem is that not all subjects have the same number of rows. How would you include this into the 0.8/0.2 train/test goal? – AR_ Aug 31 '17 at 12:00