Divide dataset between train and test respecting class distribution

Question

I want to make 10 runs of a machine learning algorithm in a given dataset with the following distribution

np.unique(x[:,24], return_counts=True)
(array([1., 2.]), array([700, 300]))

It means that 70% of my data is from class 1, and 30% are from class 2.

There is a snapshot of my data below. The last column informs the class label (1 or 2):

1,6,4,12,5,5,3,4,1,67,3,2,1,2,1,0,0,1,0,0,1,0,0,1,1
2,48,2,60,1,3,2,2,1,22,3,1,1,1,1,0,0,1,0,0,1,0,0,1,2
4,12,4,21,1,4,3,3,1,49,3,1,2,1,1,0,0,1,0,0,1,0,1,0,1
1,42,2,79,1,4,3,4,2,45,3,1,2,1,1,0,0,0,0,0,0,0,0,1,1
1,24,3,49,1,3,3,4,4,53,3,2,2,1,1,1,0,1,0,0,0,0,0,1,2
4,36,2,91,5,3,3,4,4,35,3,1,2,2,1,0,0,1,0,0,0,0,1,0,1
4,24,2,28,3,5,3,4,2,53,3,1,1,1,1,0,0,1,0,0,1,0,0,1,1
2,36,2,69,1,3,3,2,3,35,3,1,1,2,1,0,1,1,0,1,0,0,0,0,1
4,12,2,31,4,4,1,4,1,61,3,1,1,1,1,0,0,1,0,0,1,0,1,0,1
2,30,4,52,1,1,4,2,3,28,3,2,1,1,1,1,0,1,0,0,1,0,0,0,2
2,12,2,13,1,2,2,1,3,25,3,1,1,1,1,1,0,1,0,1,0,0,0,1,2
1,48,2,43,1,2,2,4,2,24,3,1,1,1,1,0,0,1,0,1,0,0,0,1,2
2,12,2,16,1,3,2,1,3,22,3,1,1,2,1,0,0,1,0,0,1,0,0,1,1
1,24,4,12,1,5,3,4,3,60,3,2,1,1,1,1,0,1,0,0,1,0,1,0,2
1,15,2,14,1,3,2,4,3,28,3,1,1,1,1,1,0,1,0,1,0,0,0,1,1
1,24,2,13,2,3,2,2,3,32,3,1,1,1,1,0,0,1,0,0,1,0,1,0,2
4,24,4,24,5,5,3,4,2,53,3,2,1,1,1,0,0,1,0,0,1,0,0,1,1
1,30,0,81,5,2,3,3,3,25,1,3,1,1,1,0,0,1,0,0,1,0,0,1,1
2,24,2,126,1,5,2,2,4,44,3,1,1,2,1,0,1,1,0,0,0,0,0,0,2
4,24,2,34,3,5,3,2,3,31,3,1,2,2,1,0,0,1,0,0,1,0,0,1,1
4,9,4,21,1,3,3,4,3,48,3,3,1,2,1,1,0,1,0,0,1,0,0,1,1
1,6,2,26,3,3,3,3,1,44,3,1,2,1,1,0,0,1,0,1,0,0,0,1,1
1,10,4,22,1,2,3,3,1,48,3,2,2,1,2,1,0,1,0,1,0,0,1,0,1
2,12,4,18,2,2,3,4,2,44,3,1,1,1,1,0,1,1,0,0,1,0,0,1,1
4,10,4,21,5,3,4,1,3,26,3,2,1,1,2,0,0,1,0,0,1,0,0,1,1
1,6,2,14,1,3,3,2,1,36,1,1,1,2,1,0,0,1,0,0,1,0,1,0,1
4,6,0,4,1,5,4,4,3,39,3,1,1,1,1,0,0,1,0,0,1,0,1,0,1
3,12,1,4,4,3,2,3,1,42,3,2,1,1,1,0,0,1,0,1,0,0,0,1,1
2,7,2,24,1,3,3,2,1,34,3,1,1,1,1,0,0,0,0,0,1,0,0,1,1
1,60,3,68,1,5,3,4,4,63,3,2,1,2,1,0,0,1,0,0,1,0,0,1,2
2,18,2,19,4,2,4,3,1,36,1,1,1,2,1,0,0,1,0,0,1,0,0,1,1
1,24,2,40,1,3,3,2,3,27,2,1,1,1,1,0,0,1,0,0,1,0,0,1,1
2,18,2,59,2,3,3,2,3,30,3,2,1,2,1,1,0,1,0,0,1,0,0,1,1
4,12,4,13,5,5,3,4,4,57,3,1,1,1,1,0,0,1,0,1,0,0,1,0,1
3,12,2,15,1,2,2,1,2,33,1,1,1,2,1,0,0,1,0,0,1,0,0,0,1
2,45,4,47,1,2,3,2,2,25,3,2,1,1,1,0,0,1,0,0,1,0,1,0,2
4,48,4,61,1,3,3,3,4,31,1,1,1,2,1,0,0,1,0,0,0,0,0,1,1

The full dataset can be found here

I would like to split the data into 90% to train and 10% to test. However, for each split, I must maintain the proportion of data (for example, in training and validation splits 70% of data must be of class 1, and 30% of class 2)

I know how to simply divide the data into train and test, but I don't know how to make this division to obey the class distribution I cited above. How to do that in Python?

I think in your last edit it is not as clear that you want to repeat the train/test split 10 times. It seemed clearer before — yatu, Apr 28 '20 at 08:30
Just use scikit-learn's `train_test_split` with argument `stratify=y` (where `y` is your class variable) https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html — desertnaut, Apr 28 '20 at 10:03
Looks like op wants to repeat this n times, though. From the unedited question `this should be repeated 10 times`. So my guess is that OP wants to perform some sort of k_fold cross validation here. So `RepeatedStratifiedKFold` seems like a good way to adapt these requirements. `train_test_split` by its own would not enable for that to be done @desertnaut — yatu, Apr 28 '20 at 10:57
Did I understand correctly @mad ? Do you want to repeat train/test split 10 times? — yatu, Apr 28 '20 at 10:58
The repetition is not that important. What I want is, for each split, the proportions between class imbalance should be respected. For example, in the training set, 70% (or something close) of samples is from one class, and 30% (or something close) is from the other. — mad, Apr 28 '20 at 11:01
In that case you can just use `train_test_split ` as desertnaut suggests. The whole point of using `RepeatedStratifiedKFold` was to answer the question as was initially posed, so to repeat the splitting n times — yatu, Apr 28 '20 at 11:02
As a future reference, please make sure you specify more accurately what you want from the beginning to avoid confusions. if `I want to make 10 runs of a machine learning algorithm in a given dataset with the following distribution` is not relevant to the solution, then the question would be clearer without that spec — yatu, Apr 28 '20 at 11:05

yatu · Accepted Answer · 2020-04-28T08:57:07.677

You could use RepeatedStratifiedKFold, which as its name suggests, repeats a K-Fold cross validator n times. To repeat the process 10 times, set n_repeats, and to have a proportion of 9:1 approximately in the train/test sizes, we can set n_splits=10:

from sklearn.model_selection import RepeatedStratifiedKFold

X = a[:,:-1]
y = a[:,-1]

rskf = RepeatedStratifiedKFold(n_splits=10, n_repeats=10, random_state=2)

for train_index, test_index in rskf.split(X, y):
    X_train, X_test = X[train_index], X[test_index]
    y_train, y_test = y[train_index], y[test_index]
    print(f'\nClass 1: {((y_train==1).sum()/len(y_train))*100:.0f}%') 
    print(f'\nShape of train: {X_train.shape[0]}')
    print(f'Shape of test: {X_test.shape[0]}')

Class 1: 73%

Shape of train: 33
Shape of test: 4

Class 1: 73%

Shape of train: 33
Shape of test: 4

Class 1: 73%

Shape of train: 33
Shape of test: 4

Class 1: 73%

Shape of train: 33
Shape of test: 4
...

Thanks for your answer. Will the command respect the proportion between classes? I mean, in each training/testing split, will I have 70% of one class and 30% of the other? — mad, Apr 28 '20 at 10:59
Yes, Check out the print statements. This is basically what `StratifiedKFold` does @mad. From the docs: `The folds are made by preserving the percentage of samples for each class.` — yatu, Apr 28 '20 at 11:01

score 0 · Answer 2 · answered Apr 28 '20 at 09:10

0

A well known way to split data into train and test is scikit-learn train_test_split.

API documentation for model_selection.train_test_split.

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.10, random_state=42)

You could play with the random_state variable (a seed) until your proportion between classes was correct. While train_test_split won't enforce the proportions, it generally follow the proportions in the population.

answered Apr 28 '20 at 09:10

rajah9

11,645
5
44
57

`train_test_split` *will* enforce the proportions with the `stratify` argument https://scikit-learn.org/stable/modules/generated/sklearn.model_selection.train_test_split.html – desertnaut Apr 28 '20 at 10:29

Divide dataset between train and test respecting class distribution

2 Answers2