Stratified Train/Validation/Test-split in scikit-learn

Question

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data into 3 sets (train, validation and test)?). But what about doing stratified train/validation/test split.

The closest approximation that comes to mind for doing stratified (on class label) train/validation/test split is as follows, but I suspect there's a better way that can perhaps achieve this in one function call or in a more accurate way:

Let's say we want to do a 60/20/20 train/validation/test split, then my current approach is to first do 60/40 stratified split, then do a 50/50 stratifeid split on that first 40 as to ultimately get a 60/20/20 stratified split.

from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.4, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)

Please get back if my approach is correct and/or if you have a better approach.

Thank you

Same problem here, have you confirmed if this is they correct way to do it? — AritzBi, Dec 05 '16 at 09:57
@AritzBi I haven't gotten confirmation from anyone but seems to work. However, ultimately I ended up going with a different approach whereby I just do a stratified train/test split, then for validation I rely on a stratified k-fold cross-validation within the training set. Check out: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold — blu, Dec 05 '16 at 13:55
That's exactly what I do as well! It's too bad there isn't a built-in way to do this with sklearn. — Gyan Veda, Dec 20 '16 at 20:36

score 4 · Answer 1 · answered Dec 20 '18 at 23:59

The solution is to just use StratifiedShuffleSplit twice, like below:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
    train_set = df.iloc[train_index]
    test_valid_set = df.iloc[test_valid_index]

split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
    test_set = test_valid_set.iloc[test_index]
    valid_set = test_valid_set.iloc[valid_index]

But this doesn't answer the question which is : is the way to do it as explained is a good way ? — Alex Dana, Jul 13 '21 at 17:35

score 2 · Answer 2 · edited Sep 05 '18 at 10:46

2

Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.

In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

edited Sep 05 '18 at 10:46

tuomastik

4,559
5
36
48

answered Sep 04 '18 at 20:53

rocksteady

1,697
14
18

There is a problem splitting things twice. Eg. Imagine the worst case. There are 11 samples in the dataset. All 11 are of the same class. You want to split the such that train-set has 10 samples and test-set has 1 sample. The first one split will work as it will produce 10:1 split. But the second split can't happen because there is only one sample. Best sticking to k-fold cross validation. – Sandeep Thapa Aug 17 '22 at 09:19

Stratified Train/Validation/Test-split in scikit-learn

2 Answers2

Linked