2

l have a dataset (numpy vector) with 50 classes and 9000 training examples.

x_train=(9000,2048)
y_train=(9000,)  # Classes are strings 
classes=list(set(y_train))

l would like to build a sub-dataset such that each class will have 5 examples

which means l get 5*50=250 training examples. Hence my sub-dataset will take this form :

sub_train_data=(250,2048)
sub_train_labels=(250,)

Remark : we take randomly 5 examples from each class (total number of classes = 50)

Thank you

Joseph
  • 343
  • 6
  • 18
  • Sounds good. What keeps you from doing that? – MB-F Jan 24 '18 at 14:53
  • l want to know do that to get an estimation of how many examples needed to get a top accuracy. l would like to start with 5 examples for each class, then 10, 20, 40 , 80 , 160,320 ... and plot the accuracy. Once the accuracy remains the same, l stop data labeling . – Joseph Jan 24 '18 at 14:56
  • any ide @kazemakase ? – Joseph Jan 24 '18 at 15:06
  • No, because I have no idea where the problem is. Did you try anything yet? Where did you get stuck? What is the actual question? – MB-F Jan 24 '18 at 15:15
  • @kazemakase, here is an answer to the question – Joseph Jan 24 '18 at 16:45

4 Answers4

2

Here is a solution for that problem :

from collections import Counter
import numpy as np
import matplotlib.pyplot as plt

def balanced_sample_maker(X, y, sample_size, random_seed=42):
    uniq_levels = np.unique(y)
    uniq_counts = {level: sum(y == level) for level in uniq_levels}

    if not random_seed is None:
        np.random.seed(random_seed)

    # find observation index of each class levels
    groupby_levels = {}
    for ii, level in enumerate(uniq_levels):
        obs_idx = [idx for idx, val in enumerate(y) if val == level]
        groupby_levels[level] = obs_idx
    # oversampling on observations of each label
    balanced_copy_idx = []
    for gb_level, gb_idx in groupby_levels.items():
        over_sample_idx = np.random.choice(gb_idx, size=sample_size, replace=True).tolist()
        balanced_copy_idx+=over_sample_idx
    np.random.shuffle(balanced_copy_idx)

    data_train=X[balanced_copy_idx]
    labels_train=y[balanced_copy_idx]
    if  ((len(data_train)) == (sample_size*len(uniq_levels))):
        print('number of sampled example ', sample_size*len(uniq_levels), 'number of sample per class ', sample_size, ' #classes: ', len(list(set(uniq_levels))))
    else:
        print('number of samples is wrong ')

    labels, values = zip(*Counter(labels_train).items())
    print('number of classes ', len(list(set(labels_train))))
    check = all(x == values[0] for x in values)
    print(check)
    if check == True:
        print('Good all classes have the same number of examples')
    else:
        print('Repeat again your sampling your classes are not balanced')
    indexes = np.arange(len(labels))
    width = 0.5
    plt.bar(indexes, values, width)
    plt.xticks(indexes + width * 0.5, labels)
    plt.show()
    return data_train,labels_train

X_train,y_train=balanced_sample_maker(X,y,10)

inspired by Scikit-learn balanced subsampling

Joseph
  • 343
  • 6
  • 18
  • Excellent. Are you sure you want to use `replace=True`? This means the same data point can occur in the subsample more than once. – MB-F Jan 25 '18 at 06:54
  • Yes in case l have only one example in a given class – Joseph Jan 25 '18 at 12:24
  • But in reality, l use False. I use true mainly when l want to increase drastically the dataset – Joseph Jan 25 '18 at 12:25
  • How can I do the same when I have to take differrent number of samples from each set. (or np.unique(y)) – Fasty Oct 03 '19 at 05:42
1

Pure numpy solution:

def sample(X, y, samples):
    unique_ys = np.unique(y, axis=0)
    result = []
    for unique_y in unique_ys:
        val_indices = np.argwhere(y==unique_y).flatten()
        random_samples = np.random.choice(val_indices, samples, replace=False)
        ret.append(X[random_samples])
    return np.concatenate(result)
Kani
  • 810
  • 1
  • 19
  • 38
0

I usually use a trick from scikit-learn for this. I use the StratifiedShuffleSplit function. So if I have to select 1/n fraction of my train set, I divide the data into n folds and set the proportion of test data (test_size) as 1-1/n. Here is an example where I use only 1/10 of my data.

sp = StratifiedShuffleSplit(n_splits=1, test_size=0.9, random_state=seed)
  for train_index, _ in sp.split(x_train, y_train):
    x_train, y_train = x_train[train_index], y_train[train_index]
0

You can use dataframe as input (as in my case), and use simple code below:

col = target
nsamples = min(t4m[col].value_counts().values)
res = pd.DataFrame()
for val in t4m[col].unique():
  t = t4m.loc[t4m[col] == val].sample(nsamples)
  res = pd.concat([res, t], ignore_index=True).sample(frac=1)

col is the name of your column with classes. Code finds minority class, shuffles dataframe, then takes sample of size of minority class from each class.

Then you can convert result back to np.array

Sergey Zaitsev
  • 555
  • 5
  • 6