How to balance a dataset without oversampling

Question

I am trying to balance my dataset, But I am struggling in finding the right way to do it. Let me set the problem. I have a multiclass dataset with the following class weights:

class     weight
2.0       0.700578
4.0       0.163401
3.0       0.126727
1.0       0.009294

As you can see the dataset is pretty unbalanced. What I would like to do is to obtain a balanced dataset in which each class is represented with the same weight.

There are a lot of questions regarding but:

Scikit-learn balanced subsampling: this subsamples can be overlapping, which for my approach is wrong. Moreover, I would like to get that using sklearn or packages that are well tested.
How to perform undersampling (the right way) with python scikit-learn?: here they suggest to use an unbalanced dataset with a balance class weight vector, however, I need to have this balance dataset, is not a matter of which model and which weights.
https://github.com/scikit-learn-contrib/imbalanced-learn: a lot of question refers to this package. Below an example on how I am trying to use it.

Here the example:

from imblearn.ensemble import EasyEnsembleClassifier    
eec = EasyEnsembleClassifier(random_state=42, sampling_strategy='not minority', n_estimators=2)
eec.fit(data_for, label_all.loc[data_for.index,'LABEL_O_majority'])
new_data = eec.estimators_samples_

However, the returned indexes are all the indexes of the initial data and they are repeated n_estimators times.

Here the result:

[array([   0,    1,    2, ..., 1196, 1197, 1198]),
 array([   0,    1,    2, ..., 1196, 1197, 1198])]

Finally, a lot of techniques use oversampling but would like to not use them. Only for class 1 I can tolerate oversampling, as it is very predictable. I am wondering if really sklearn, or this contrib package do not have a function that does this.

You can do under-sampling. And by that I dont mean change the weights, but directly remove the data from majority classes to match with the minority class. ImbLearn has utilities to do that too. — Vivek Kumar, Nov 29 '18 at 06:52
Ok, can you point me to the right utilities, I am not finding any method in the documentation for that. @VivekKumar — Guido Muscioni, Nov 29 '18 at 07:33
https://imbalanced-learn.readthedocs.io/en/stable/under_sampling.html — Vivek Kumar, Nov 29 '18 at 08:43
I have used the Random Undersampler from imblearn, it does work well than others: https://imbalanced-learn.readthedocs.io/en/stable/generated/imblearn.under_sampling.RandomUnderSampler.html but I would recommend you to avoid undersampling as much as possible and go for oversampling techniques. — Aditya Bhattacharya, Feb 05 '20 at 03:38

score 0 · Answer 1 · answered Feb 05 '20 at 03:36

Based on my experience, under-sampling really doesn't help every time as we are not utilizing the total available data and this approach might lead to a lot of overfitting. Synthetic Minority Over-sampling Technique (SMOTE) has worked well with most type of data (both structured and unstructured data like images), although the performance can be slow sometimes. But it is easy to use and available through [imblearn][1] In case if you want to try oversampling techniques this particular article might help: https://medium.com/@adib0073/how-to-use-smote-for-dealing-with-imbalanced-image-dataset-for-solving-classification-problems-3aba7d2b9cad But for undersampling as mentioned in the above comments, you would have to slice your dataframe or array of the majority class to match the size of the minority class

score 0 · Answer 2 · edited Jun 20 '20 at 09:12

try to apply iterative-stratification

On the Stratification of Multi-label Data :

Stratified sampling is a sampling method that takes into account the existence of disjoint groups within a population and pro- duces samples where the proportion of these groups is maintained. In single-label classification tasks, groups are differentiated based on the value of the target variable. In multi-label learning tasks, however, where there are multiple target variables, it is not clear how stratified sam- pling could/should be performed. This paper investigates stratification in the multi-label data context. It considers two stratification methods for multi-label data and empirically compares them along with random sampling on a number of datasets and based on a number of evaluation criteria. The results reveal some interesting conclusions with respect to the utility of each method for particular types of multi-label datasets.

How to balance a dataset without oversampling

2 Answers2