sample X examples from each class label

Question

l have a dataset (numpy vector) with 50 classes and 9000 training examples.

x_train=(9000,2048)
y_train=(9000,)  # Classes are strings 
classes=list(set(y_train))

l would like to build a sub-dataset such that each class will have 5 examples

which means l get 5*50=250 training examples. Hence my sub-dataset will take this form :

sub_train_data=(250,2048)
sub_train_labels=(250,)

Remark : we take randomly 5 examples from each class (total number of classes = 50)

Thank you

l want to know do that to get an estimation of how many examples needed to get a top accuracy. l would like to start with 5 examples for each class, then 10, 20, 40 , 80 , 160,320 ... and plot the accuracy. Once the accuracy remains the same, l stop data labeling . — Joseph, Jan 24 '18 at 14:56
No, because I have no idea where the problem is. Did you try anything yet? Where did you get stuck? What is the actual question? — MB-F, Jan 24 '18 at 15:15

score 2 · Answer 1 · answered Jan 24 '18 at 16:42

Here is a solution for that problem :

from collections import Counter
import numpy as np
import matplotlib.pyplot as plt

def balanced_sample_maker(X, y, sample_size, random_seed=42):
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.items():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    data_train=X[balanced_copy_idx]
    labels_train=y[balanced_copy_idx]
    if  ((len(data_train)) == (sample_size*len(uniq_levels))):
        print('number of sampled example ', sample_size*len(uniq_levels), 'number of sample per class ', sample_size, ' #classes: ', len(list(set(uniq_levels))))
    else:
        print('number of samples is wrong ')

    labels, values = zip(*Counter(labels_train).items())
    print('number of classes ', len(list(set(labels_train))))
    check = all(x == values[0] for x in values)
    print(check)
    if check == True:
        print('Good all classes have the same number of examples')
    else:
        print('Repeat again your sampling your classes are not balanced')
    indexes = np.arange(len(labels))
    width = 0.5
    plt.bar(indexes, values, width)
    plt.xticks(indexes + width * 0.5, labels)
    plt.show()
    return data_train,labels_train

X_train,y_train=balanced_sample_maker(X,y,10)

inspired by Scikit-learn balanced subsampling

Excellent. Are you sure you want to use `replace=True`? This means the same data point can occur in the subsample more than once. — MB-F, Jan 25 '18 at 06:54
But in reality, l use False. I use true mainly when l want to increase drastically the dataset — Joseph, Jan 25 '18 at 12:25
How can I do the same when I have to take differrent number of samples from each set. (or np.unique(y)) — Fasty, Oct 03 '19 at 05:42

score 1 · Answer 2 · answered Jun 02 '22 at 14:21

Pure numpy solution:

def sample(X, y, samples):
    unique_ys = np.unique(y, axis=0)
    result = []
    for unique_y in unique_ys:
        val_indices = np.argwhere(y==unique_y).flatten()
        random_samples = np.random.choice(val_indices, samples, replace=False)
        ret.append(X[random_samples])
    return np.concatenate(result)

Sukrit Gupta · Answer 3 · 2018-10-12T09:53:10.097

I usually use a trick from scikit-learn for this. I use the StratifiedShuffleSplit function. So if I have to select 1/n fraction of my train set, I divide the data into n folds and set the proportion of test data (test_size) as 1-1/n. Here is an example where I use only 1/10 of my data.

sp = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=seed)
  for train_index, _ in sp.split(x_train, y_train):
    x_train, y_train = x_train[train_index], y_train[train_index]

score 0 · Answer 4 · answered Jun 18 '21 at 12:59

You can use dataframe as input (as in my case), and use simple code below:

col = target
nsamples = min(t4m[col].value_counts().values)
res = pd.DataFrame()
for val in t4m[col].unique():
  t = t4m.loc[t4m[col] == val].sample(nsamples)
  res = pd.concat([res, t], ignore_index=True).sample(frac=1)

col is the name of your column with classes. Code finds minority class, shuffles dataframe, then takes sample of size of minority class from each class.

Then you can convert result back to np.array

sample X examples from each class label

4 Answers4