I've just seen this answer on SO which shows how to split data using numpy.
Assume we're going to split them as 0.8
, 0.1
, 0.1
for training, testing, and validation respectively, you do it this way:
train, test, val = np.split(df, [int(.8 * len(df)), int(.9 * len(df))])
I'm interested to know how could I consider stratifying while splitting data using this methodology.
Stratifying is splitting data while keeping the priors of each class you have in data. That is if you're going to take
0.8
for the training set, you take 0.8 from each class you have. Same for test and train.
I tried grouping the data first by class using:
grouped_df = df.groupby(class_col_name, group_keys=False)
But it did not show correct results.
Note: I'm familiar with train_test_split