68

I'm trying to create N balanced random subsamples of my large unbalanced dataset. Is there a way to do this simply with scikit-learn / pandas or do I have to implement it myself? Any pointers to code that does this?

These subsamples should be random and can be overlapping as I feed each to separate classifier in a very large ensemble of classifiers.

In Weka there is tool called spreadsubsample, is there equivalent in sklearn? http://wiki.pentaho.com/display/DATAMINING/SpreadSubsample

(I know about weighting but that's not what I'm looking for.)

mikkom
  • 3,521
  • 5
  • 25
  • 39
  • You want to just split your dataset into N equal sized subsets of data or do you really just want to perform cross-validation? See [`cross_validation`](http://scikit-learn.org/stable/modules/cross_validation.html) and specifically [`K-Fold`](http://scikit-learn.org/stable/modules/generated/sklearn.cross_validation.KFold.html) – EdChum May 04 '14 at 19:21
  • 1
    I know about cross validation functions, problem is that test size cannot be zero (they give an error). I'm using huge (tens of thousands classifiers) ensemble so it must be fast. It seems there is no such function which is surprising so I think I'll have to implement a custom one. – mikkom May 05 '14 at 05:31
  • 1
    FYI a sklearn-contrib package for learning on and dealing with imbalanced class data now exists https://github.com/scikit-learn-contrib/imbalanced-learn – eickenberg Nov 16 '17 at 00:37
  • 2
    @eickenberg, you should also post that comment as an answer, it's easier to find an answer than a comment and I would say that using already existing library is probably the best answer for my original question. – mikkom Nov 17 '17 at 10:34

14 Answers14

39

There now exists a full-blown python package to address imbalanced data. It is available as a sklearn-contrib package at https://github.com/scikit-learn-contrib/imbalanced-learn

eickenberg
  • 14,152
  • 1
  • 48
  • 52
30

Here is my first version that seems to be working fine, feel free to copy or make suggestions on how it could be more efficient (I have quite a long experience with programming in general but not that long with python or numpy)

This function creates single random balanced subsample.

edit: The subsample size now samples down minority classes, this should probably be changed.

def balanced_subsample(x,y,subsample_size=1.0):

    class_xs = []
    min_elems = None

    for yi in np.unique(y):
        elems = x[(y == yi)]
        class_xs.append((yi, elems))
        if min_elems == None or elems.shape[0] < min_elems:
            min_elems = elems.shape[0]

    use_elems = min_elems
    if subsample_size < 1:
        use_elems = int(min_elems*subsample_size)

    xs = []
    ys = []

    for ci,this_xs in class_xs:
        if len(this_xs) > use_elems:
            np.random.shuffle(this_xs)

        x_ = this_xs[:use_elems]
        y_ = np.empty(use_elems)
        y_.fill(ci)

        xs.append(x_)
        ys.append(y_)

    xs = np.concatenate(xs)
    ys = np.concatenate(ys)

    return xs,ys

For anyone trying to make the above work with a Pandas DataFrame, you need to make a couple of changes:

  1. Replace the np.random.shuffle line with

    this_xs = this_xs.reindex(np.random.permutation(this_xs.index))

  2. Replace the np.concatenate lines with

    xs = pd.concat(xs) ys = pd.Series(data=np.concatenate(ys),name='target')

Charlie Haley
  • 4,152
  • 4
  • 22
  • 36
mikkom
  • 3,521
  • 5
  • 25
  • 39
  • How would you extend this to balancing a sample with custom classes i.e. not just 1 or 0, but let's say `"no_region"` and `"region"` (binary non-numeric classes) or even where x and y are multi-class? – Dhruv Ghulati Jul 04 '16 at 21:13
12

I found the best solutions here

And this is the one I think it's the simplest.

dataset = pd.read_csv("data.csv")
X = dataset.iloc[:, 1:12].values
y = dataset.iloc[:, 12].values

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler(return_indices=True)
X_rus, y_rus, id_rus = rus.fit_sample(X, y)

then you can use X_rus, y_rus data

For versions 0.4<=:

from imblearn.under_sampling import RandomUnderSampler
rus = RandomUnderSampler()
X_rus, y_rus= rus.fit_sample(X, y)

Then, indices of the samples randomly selected can be reached by sample_indices_ attribute.

LinNotFound
  • 531
  • 1
  • 5
  • 12
9

A version for pandas Series:

import numpy as np

def balanced_subsample(y, size=None):

    subsample = []

    if size is None:
        n_smp = y.value_counts().min()
    else:
        n_smp = int(size / len(y.value_counts().index))

    for label in y.value_counts().index:
        samples = y[y == label].index.values
        index_range = range(samples.shape[0])
        indexes = np.random.choice(index_range, size=n_smp, replace=False)
        subsample += samples[indexes].tolist()

    return subsample
gc5
  • 9,468
  • 24
  • 90
  • 151
5

This type of data splitting is not provided among the built-in data splitting techniques exposed in sklearn.cross_validation.

What seems similar to your needs is sklearn.cross_validation.StratifiedShuffleSplit, which can generate subsamples of any size while retaining the structure of the whole dataset, i.e. meticulously enforcing the same unbalance that is in your main dataset. While this is not what you are looking for, you may be able to use the code therein and change the imposed ratio to 50/50 always.

(This would probably be a very good contribution to scikit-learn if you feel up to it.)

eickenberg
  • 14,152
  • 1
  • 48
  • 52
  • 1
    It should be very simple to implement, ie. divide the data to classes shuffle and then just take N first elements of each set. I'll see if I can contribute it easily after I have implemented it. – mikkom May 05 '14 at 15:47
  • I posted the first implementation as an answer. – mikkom May 05 '14 at 19:42
  • I'm not sure if this is still of interest to you, but while I'd agree that there isn't a dedicated function for this in `sklearn`, in [my answer below](http://stackoverflow.com/a/40798326/3540074) I suggested a way to use existing `sklearn` functions to equivalent effect. – kadu Nov 25 '16 at 05:36
  • OP wasn't looking for stratified methods, which *keep* the ratio of labels in folds. Your answer and mine do stratification. The difference is that in your choice the folds cannot overlap. This can be wanted in certain cases, but the OP explicitely permitted overlap here. – eickenberg Nov 25 '16 at 10:02
3

Below is my python implementation for creating balanced data copy. Assumptions: 1. target variable (y) is binary class (0 vs. 1) 2. 1 is the minority.

from numpy import unique
from numpy import random 

def balanced_sample_maker(X, y, random_seed=None):
    """ return a balanced data set by oversampling minority class 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx

    # oversampling on observations of positive label
    sample_size = uniq_counts[0]
    over_sample_idx = random.choice(groupby_levels[1], size=sample_size, replace=True).tolist()
    balanced_copy_idx = groupby_levels[0] + over_sample_idx
    random.shuffle(balanced_copy_idx)

    return X[balanced_copy_idx, :], y[balanced_copy_idx]
beingzy
  • 59
  • 1
  • 3
3

Here is a version of the above code that works for multiclass groups (in my tested case group 0, 1, 2, 3, 4)

import numpy as np
def balanced_sample_maker(X, y, sample_size, random_seed=None):
    """ return a balanced data set by sampling all classes with sample_size 
        current version is developed on assumption that the positive
        class is the minority.

    Parameters:
    ===========
    X: {numpy.ndarrray}
    y: {numpy.ndarray}
    """
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.iteritems():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    return (X[balanced_copy_idx, :], y[balanced_copy_idx], balanced_copy_idx)

This also returns the indices so they can be used for other datasets and to keep track of how frequently each data set was used (helpful for training)

Kevin Mader
  • 111
  • 1
  • 2
2

Simply select 100 rows in each class with duplicates using the following code. activity is my classes (labels of the dataset)

balanced_df=Pdf_train.groupby('activity',as_index = False,group_keys=False).apply(lambda s: s.sample(100,replace=True))
javac
  • 2,819
  • 1
  • 20
  • 22
2

Here my 2 cents. Assume that we have the following unbalanced dataset:

import pandas as pd
import numpy as np

df = pd.DataFrame({'Category': np.random.choice(['A','B','C'], size=1000, replace=True, p=[0.3, 0.5, 0.2]),
                   'Sentiment': np.random.choice([0,1], size=1000, replace=True, p=[0.35, 0.65]),
                   'Gender': np.random.choice(['M','F'], size=1000, replace=True, p=[0.70, 0.30])})
print(df.head())

The first rows:

  Category  Sentiment Gender
0        C          1      M
1        B          0      M
2        B          0      M
3        B          0      M
4        A          0      M

Assume now that we want to get a balanced dataset by Sentiment:

df_grouped_by = df.groupby(['Sentiment'])

df_balanced = df_grouped_by.apply(lambda x: x.sample(df_grouped_by.size().min()).reset_index(drop=True))

df_balanced = df_balanced.droplevel(['Sentiment'])
df_balanced
print(df_balanced.head())

The first rows of the balanced dataset:

  Category  Sentiment Gender
0        C          0      F
1        C          0      M
2        C          0      F
3        C          0      M
4        C          0      M

Let's verify that it is balanced in terms of Sentiment

df_balanced.groupby(['Sentiment']).size()

We get:

Sentiment
0    369
1    369
dtype: int64

As we can see we ended up with 369 positive and 369 negative Sentiment labels.

George Pipis
  • 1,452
  • 16
  • 12
1

A short, pythonic solution to balance a pandas DataFrame either by subsampling (uspl=True) or oversampling (uspl=False), balanced by a specified column in that dataframe that has two or more values.

For uspl=True, this code will take a random sample without replacement of size equal to the smallest stratum from all strata. For uspl=False, this code will take a random sample with replacement of size equal to the largest stratum from all strata.

def balanced_spl_by(df, lblcol, uspl=True):
    datas_l = [ df[df[lblcol]==l].copy() for l in list(set(df[lblcol].values)) ]
    lsz = [f.shape[0] for f in datas_l ]
    return pd.concat([f.sample(n = (min(lsz) if uspl else max(lsz)), replace = (not uspl)).copy() for f in datas_l ], axis=0 ).sample(frac=1) 

This will only work with a Pandas DataFrame, but that seems to be a common application, and restricting it to Pandas DataFrames significantly shortens the code as far as I can tell.

Roko Mijic
  • 6,655
  • 4
  • 29
  • 36
  • 1
    Exactly what I was hoping to find - using False perfectly upsampled instead of downsampling my dataframe. Thanks! – AlecZ Oct 01 '20 at 20:24
1

A slight modification to the top answer by mikkom.

If you want to preserve ordering of the larger class data, ie. you don't want to shuffle.

Instead of

    if len(this_xs) > use_elems:
        np.random.shuffle(this_xs)

do this

        if len(this_xs) > use_elems:
            ratio = len(this_xs) / use_elems
            this_xs = this_xs[::ratio]
Bert Kellerman
  • 1,590
  • 10
  • 17
0

My subsampler version, hope this helps

def subsample_indices(y, size):
    indices = {}
    target_values = set(y_train)
    for t in target_values:
        indices[t] = [i for i in range(len(y)) if y[i] == t]
    min_len = min(size, min([len(indices[t]) for t in indices]))
    for t in indices:
        if len(indices[t]) > min_len:
            indices[t] = random.sample(indices[t], min_len)
    return indices

x = [1, 1, 1, 1, 1, -1, -1, -1, -1, -1, 1, 1, 1, -1]
j = subsample_indices(x, 2)
print j
print [x[t] for t in j[-1]]
print [x[t] for t in j[1]]
hernan
  • 31
  • 2
0

Here is my solution, which can be tightly integrated in an existing sklearn pipeline:

from sklearn.model_selection import RepeatedKFold
import numpy as np


class DownsampledRepeatedKFold(RepeatedKFold):

    def split(self, X, y=None, groups=None):
        for i in range(self.n_repeats):
            np.random.seed()
            # get index of major class (negative)
            idxs_class0 = np.argwhere(y == 0).ravel()
            # get index of minor class (positive)
            idxs_class1 = np.argwhere(y == 1).ravel()
            # get length of minor class
            len_minor = len(idxs_class1)
            # subsample of major class of size minor class
            idxs_class0_downsampled = np.random.choice(idxs_class0, size=len_minor)
            original_indx_downsampled = np.hstack((idxs_class0_downsampled, idxs_class1))
            np.random.shuffle(original_indx_downsampled)
            splits = list(self.cv(n_splits=self.n_splits, shuffle=True).split(original_indx_downsampled))

            for train_index, test_index in splits:
                yield original_indx_downsampled[train_index], original_indx_downsampled[test_index]

    def __init__(self, n_splits=5, n_repeats=10, random_state=None):
        self.n_splits = n_splits
         super(DownsampledRepeatedKFold, self).__init__(
        n_splits=n_splits, n_repeats=n_repeats, random_state=random_state
    )

Use it as usual:

    for train_index, test_index in DownsampledRepeatedKFold(n_splits=5, n_repeats=10).split(X, y):
         X_train, X_test = X[train_index], X[test_index]
         y_train, y_test = y[train_index], y[test_index]
spaenigs
  • 152
  • 1
  • 10
0

Here's a solution which is:

  • simple (< 10 lines code)
  • fast (besides one for loop, pure NumPy)
  • no external dependencies other than NumPy
  • is very cheap to generate new balanced random samples (just call np.random.sample()). Useful for generating different shuffled & balanced samples between training epochs
def stratified_random_sample_weights(labels):
    sample_weights = np.zeros(num_samples)
    for class_i in range(n_classes):
        class_indices = np.where(labels[:, class_i]==1)  # find indices where class_i is 1
        class_indices = np.squeeze(class_indices)  # get rid of extra dim
        num_samples_class_i = len(class_indices)
        assert num_samples_class_i > 0, f"No samples found for class index {class_i}"
        
        sample_weights[class_indices] = 1.0/num_samples_class_i  # note: samples with no classes present will get weight=0

    return sample_weights / sample_weights.sum()  # sum(weights) == 1

Then, you use re-use these weights over and over to generate balanced indices with np.random.sample():

sample_weights = stratified_random_sample_weights(labels)
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

Full example:

# generate data
from sklearn.preprocessing import OneHotEncoder

num_samples = 10000
n_classes = 10
ground_truth_class_weights = np.logspace(1,3,num=n_classes,base=10,dtype=float)  # exponentially growing
ground_truth_class_weights /= ground_truth_class_weights.sum()  # sum to 1
labels = np.random.choice(list(range(n_classes)), size=num_samples, p=ground_truth_class_weights)
labels = labels.reshape(-1, 1)  # turn each element into a list
labels = OneHotEncoder(sparse=False).fit_transform(labels)


print(f"original counts: {labels.sum(0)}")
# [  38.   76.  127.  191.  282.  556.  865. 1475. 2357. 4033.]

sample_weights = stratified_random_sample_weights(labels)
sample_size = 1000
chosen_indices = np.random.choice(list(range(num_samples)), size=sample_size, replace=True, p=sample_weights)

print(f"rebalanced counts: {labels[chosen_indices].sum(0)}")
# [104. 107.  88. 107.  94. 118.  92.  99. 100.  91.]
crypdick
  • 16,152
  • 7
  • 51
  • 74