10

There is already a description here of how to do stratified train/test split in scikit via train_test_split (Stratified Train/Test-split in scikit-learn) and a description of how to random train/validation/test split via np.split (How to split data into 3 sets (train, validation and test)?). But what about doing stratified train/validation/test split.

The closest approximation that comes to mind for doing stratified (on class label) train/validation/test split is as follows, but I suspect there's a better way that can perhaps achieve this in one function call or in a more accurate way:

Let's say we want to do a 60/20/20 train/validation/test split, then my current approach is to first do 60/40 stratified split, then do a 50/50 stratifeid split on that first 40 as to ultimately get a 60/20/20 stratified split.

from sklearn.cross_validation import train_test_split
SEED = 2000
x_train, x_validation_and_test, y_train, y_validation_and_test = train_test_split(x, y, test_size=.4, random_state=SEED)
x_validation, x_test, y_validation, y_test = train_test_split(x_validation_and_test, y_validation_and_test, test_size=.5, random_state=SEED)

Please get back if my approach is correct and/or if you have a better approach.

Thank you

Community
  • 1
  • 1
blu
  • 829
  • 2
  • 7
  • 14
  • Same problem here, have you confirmed if this is they correct way to do it? – AritzBi Dec 05 '16 at 09:57
  • 1
    @AritzBi I haven't gotten confirmation from anyone but seems to work. However, ultimately I ended up going with a different approach whereby I just do a stratified train/test split, then for validation I rely on a stratified k-fold cross-validation within the training set. Check out: http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedKFold.html#sklearn.model_selection.StratifiedKFold – blu Dec 05 '16 at 13:55
  • Ok, thank you very much!!! – AritzBi Dec 05 '16 at 16:43
  • That's exactly what I do as well! It's too bad there isn't a built-in way to do this with sklearn. – Gyan Veda Dec 20 '16 at 20:36

2 Answers2

4

The solution is to just use StratifiedShuffleSplit twice, like below:

from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.4, random_state=42)
for train_index, test_valid_index in split.split(df, df.target):
    train_set = df.iloc[train_index]
    test_valid_set = df.iloc[test_valid_index]

split2 = StratifiedShuffleSplit(n_splits=1, test_size=0.5, random_state=42)
for test_index, valid_index in split2.split(test_valid_set, test_valid_set.target):
    test_set = test_valid_set.iloc[test_index]
    valid_set = test_valid_set.iloc[valid_index]
Anton Dergunov
  • 1,347
  • 1
  • 8
  • 3
2

Yes, this is exactly how I would do it - running train_test_split() twice. Think of the first as splitting off your training set, and then that training set may get divided into different folds or holdouts down the line.

In fact, if you end up testing your model using a scikit model that includes built-in cross-validation, you may not even have to explicitly run train_test_split() again. Same if you use the (very handy!) model_selection.cross_val_score function.

tuomastik
  • 4,559
  • 5
  • 36
  • 48
rocksteady
  • 1,697
  • 14
  • 18
  • There is a problem splitting things twice. Eg. Imagine the worst case. There are 11 samples in the dataset. All 11 are of the same class. You want to split the such that train-set has 10 samples and test-set has 1 sample. The first one split will work as it will produce 10:1 split. But the second split can't happen because there is only one sample. Best sticking to k-fold cross validation. – Sandeep Thapa Aug 17 '22 at 09:19