How to split data on balanced training set and test set on sklearn

Question

I am using sklearn for multi-classification task. I need to split alldata into train_set and test_set. I want to take randomly the same sample number from each class. Actually, I amusing this function

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0)

but it gives unbalanced dataset! Any suggestion.

if you still want to use `cross_validation.train_test_split` and you are on sklearn `0.17` you can balance training and test, check out my answer — Guiem Bosch, Feb 18 '16 at 07:50
On a side-note, for an unbalanced training set with [sklearn.ensemble.RandomForestClassifier](http://scikit-learn.org/stable/modules/generated/sklearn.ensemble.RandomForestClassifier.html#sklearn.ensemble.RandomForestClassifier) for example, `class_weight="balanced"` can be used. — Shadi, Feb 11 '18 at 04:04
@Shadi: Please not that balancing your train set is something different; `class_weight` will have an impact on your cost-minimization. — Markus, Nov 26 '21 at 15:55

Guiem Bosch · Answer 1 · 2016-02-18T07:46:22.667

48

Although Christian's suggestion is correct, technically train_test_split should give you stratified results by using the stratify param.

So you could do:

X_train, X_test, y_train, y_test = cross_validation.train_test_split(Data, Target, test_size=0.3, random_state=0, stratify=Target)

The trick here is that it starts from version 0.17 in sklearn.

From the documentation about the parameter stratify:

stratify : array-like or None (default is None) If not None, data is split in a stratified fashion, using this as the labels array. New in version 0.17: stratify splitting

edited Feb 18 '16 at 07:46

answered Feb 18 '16 at 06:57

Guiem Bosch

2,728
1
21
37

5

but if the classes are not balanced in Data (class1=200 samples, class2=250 samples,..) and I need to take (100, 100) for training and (50 ,50) for test. How Can I do it – Jeanne Feb 19 '16 at 02:23
1

there are two more parameters in the `train_test_split`: `train_size`, `test_size` (and those, apart from representing a proportion if `float`, they can also be `int`). Never tried it, but I think that `train_size=100`, `test_size=50` combined with the `stratify` param should work. – Guiem Bosch Feb 19 '16 at 04:12
2

I didn't try it, but if you o that, you should 100 training samples that follows the original distribution and 50 that follows the original distribution too. (I will change the example a little to clarify, suppouse class1=200 samples, class2=400 samples), then your train set will have 33 examples from class1 and 67 from class2, and your test set will have 18 examples from class1 and 32 from class2. As far as I understand, the original question is trying to get a train set with 50 examples from class1 and 50 from class2, but a test set with 18 examples from class1 and 32 from class2. – Rodrigo Laguna Feb 06 '18 at 18:22
3

To clarify, split using stratify creates samples of the data in the same proportion of the original. e.g. if the classes in your data are split 70/30, then stratify split will create samples with 70/30 split. – BenP Apr 04 '18 at 11:02

score 33 · Accepted Answer · edited Aug 24 '20 at 08:59

33

You can use StratifiedShuffleSplit to create datasets featuring the same percentage of classes as the original one:

import numpy as np
from sklearn.model_selection import StratifiedShuffleSplit
X = np.array([[1, 3], [3, 7], [2, 4], [4, 8]])
y = np.array([0, 1, 0, 1])
stratSplit = StratifiedShuffleSplit(y, n_iter=1, test_size=0.5, random_state=42)
for train_idx, test_idx in stratSplit:
    X_train=X[train_idx]
    y_train=y[train_idx]

print(X_train)
# [[3 7]
#  [2 4]]
print(y_train)
# [1 0]

edited Aug 24 '20 at 08:59

ptyshevs

1,602
11
26

answered Feb 18 '16 at 04:49

Christian Hirsch

1,996
12
16

8

Note from documentation: StratifiedShuffleSplit is deprecated since version 0.18: This module will be removed in 0.20. Use [sklearn.model_selection.StratifiedShuffleSplit](http://scikit-learn.org/stable/modules/generated/sklearn.model_selection.StratifiedShuffleSplit.html#sklearn.model_selection.StratifiedShuffleSplit) instead. – mc2 Nov 13 '17 at 08:31
"*to create datasets featuring the same percentage of classes as the original one:"* according to https://github.com/scikit-learn/scikit-learn/issues/8913 this is not always the case. – gented Nov 24 '17 at 10:45
Code is untested I suppose, as I get the error that stratSplit is not iterable. – Pfinnn Jun 08 '21 at 18:27

antike · Answer 3 · 2017-12-13T09:38:42.533

If the classes are not balanced but you want the split to be balanced, then stratifying isn't going to help. There doesn't seem to be a method for doing balanced sampling in sklearn but it's kind of easy using basic numpy, for example a function like this might help you:

def split_balanced(data, target, test_size=0.2):

    classes = np.unique(target)
    # can give test_size as fraction of input data size of number of samples
    if test_size<1:
        n_test = np.round(len(target)*test_size)
    else:
        n_test = test_size
    n_train = max(0,len(target)-n_test)
    n_train_per_class = max(1,int(np.floor(n_train/len(classes))))
    n_test_per_class = max(1,int(np.floor(n_test/len(classes))))

    ixs = []
    for cl in classes:
        if (n_train_per_class+n_test_per_class) > np.sum(target==cl):
            # if data has too few samples for this class, do upsampling
            # split the data to training and testing before sampling so data points won't be
            #  shared among training and test data
            splitix = int(np.ceil(n_train_per_class/(n_train_per_class+n_test_per_class)*np.sum(target==cl)))
            ixs.append(np.r_[np.random.choice(np.nonzero(target==cl)[0][:splitix], n_train_per_class),
                np.random.choice(np.nonzero(target==cl)[0][splitix:], n_test_per_class)])
        else:
            ixs.append(np.random.choice(np.nonzero(target==cl)[0], n_train_per_class+n_test_per_class,
                replace=False))

    # take same num of samples from all classes
    ix_train = np.concatenate([x[:n_train_per_class] for x in ixs])
    ix_test = np.concatenate([x[n_train_per_class:(n_train_per_class+n_test_per_class)] for x in ixs])

    X_train = data[ix_train,:]
    X_test = data[ix_test,:]
    y_train = target[ix_train]
    y_test = target[ix_test]

    return X_train, X_test, y_train, y_test

Note that if you use this and sample more points per class than in the input data, then those will be upsampled (sample with replacement). As a result, some data points will appear multiple times and this may have an effect on the accuracy measures etc. And if some class has only one data point, there will be an error. You can easily check the numbers of points per class for example with np.unique(target, return_counts=True)

I like the principle, however I think there's a problem with the current implementation that the random sampling may assign identical samples to train and test sets. The sampling should probably collect train and test indices from separate pools. — DonSteep, Dec 09 '17 at 22:13
You're absolutely right and I tried to mention this by saying "you might have replicated points in your training and test data, which can cause your model performance look overly optimistic" but I now understand the wording might not have been perfect, sorry about that. I'll edit the code so that there won't be shared data points anymore. — antike, Dec 13 '17 at 09:30
I'm not sure whether your post is accurate. When you mention "balanced," do you mean that the proportion of each class is about equal? Or do you mean that the test set has about the same distribution of the classes that the train set has. Stratified sampling can achieve the latter. — JoAnn Alvarez, Sep 25 '20 at 16:53

score 1 · Answer 4 · answered Nov 09 '21 at 01:57

Another approach is to over- or under- sample from your stratified test/train split. The imbalanced-learn library is quite handy for this, specially useful if you are doing online learning & want to guarantee balanced train data within your pipelines.

from imblearn.pipeline import Pipeline as ImbalancePipeline

model = ImbalancePipeline(steps=[
  ('data_balancer', RandomOverSampler()),
  ('classifier', SVC()),
])

score 0 · Answer 5 · answered Dec 28 '17 at 23:24

This is my implementation that I use to get train/test data indexes

def get_safe_balanced_split(target, trainSize=0.8, getTestIndexes=True, shuffle=False, seed=None):
    classes, counts = np.unique(target, return_counts=True)
    nPerClass = float(len(target))*float(trainSize)/float(len(classes))
    if nPerClass > np.min(counts):
        print("Insufficient data to produce a balanced training data split.")
        print("Classes found %s"%classes)
        print("Classes count %s"%counts)
        ts = float(trainSize*np.min(counts)*len(classes)) / float(len(target))
        print("trainSize is reset from %s to %s"%(trainSize, ts))
        trainSize = ts
        nPerClass = float(len(target))*float(trainSize)/float(len(classes))
    # get number of classes
    nPerClass = int(nPerClass)
    print("Data splitting on %i classes and returning %i per class"%(len(classes),nPerClass ))
    # get indexes
    trainIndexes = []
    for c in classes:
        if seed is not None:
            np.random.seed(seed)
        cIdxs = np.where(target==c)[0]
        cIdxs = np.random.choice(cIdxs, nPerClass, replace=False)
        trainIndexes.extend(cIdxs)
    # get test indexes
    testIndexes = None
    if getTestIndexes:
        testIndexes = list(set(range(len(target))) - set(trainIndexes))
    # shuffle
    if shuffle:
        trainIndexes = random.shuffle(trainIndexes)
        if testIndexes is not None:
            testIndexes = random.shuffle(testIndexes)
    # return indexes
    return trainIndexes, testIndexes

score 0 · Answer 6 · answered Nov 25 '22 at 16:53

This is the function I am using. You can adapt it and optimize it.

# Returns a Test dataset that contains an equal amounts of each class
# y should contain only two classes 0 and 1
def TrainSplitEqualBinary(X, y, samples_n): #samples_n per class
    
    indicesClass1 = []
    indicesClass2 = []
    
    for i in range(0, len(y)):
        if y[i] == 0 and len(indicesClass1) < samples_n:
            indicesClass1.append(i)
        elif y[i] == 1 and len(indicesClass2) < samples_n:
            indicesClass2.append(i)
            
        if len(indicesClass1) == samples_n and len(indicesClass2) == samples_n:
            break
    
    X_test_class1 = X[indicesClass1]
    X_test_class2 = X[indicesClass2]
    
    X_test = np.concatenate((X_test_class1,X_test_class2), axis=0)
    
    #remove x_test from X
    X_train = np.delete(X, indicesClass1 + indicesClass2, axis=0)
    
    Y_test_class1 = y[indicesClass1]
    Y_test_class2 = y[indicesClass2]
    
    y_test = np.concatenate((Y_test_class1,Y_test_class2), axis=0)
    
    #remove y_test from y
    y_train = np.delete(y, indicesClass1 + indicesClass2, axis=0)
    
    if (X_test.shape[0] != 2 * samples_n or y_test.shape[0] != 2 * samples_n):
        raise Exception("Problem with split 1!")
        
    if (X_train.shape[0] + X_test.shape[0] != X.shape[0] or y_train.shape[0] + y_test.shape[0] != y.shape[0]):
        raise Exception("Problem with split 2!")
    
    return X_train, X_test, y_train, y_test

How to split data on balanced training set and test set on sklearn

6 Answers6