I am trying to balance my dataset, But I am struggling in finding the right way to do it. Let me set the problem. I have a multiclass dataset with the following class weights:
class weight
2.0 0.700578
4.0 0.163401
3.0 0.126727
1.0 0.009294
As you can see the dataset is pretty unbalanced. What I would like to do is to obtain a balanced dataset in which each class is represented with the same weight.
There are a lot of questions regarding but:
- Scikit-learn balanced subsampling: this subsamples can be overlapping, which for my approach is wrong. Moreover, I would like to get that using sklearn or packages that are well tested.
- How to perform undersampling (the right way) with python scikit-learn?: here they suggest to use an unbalanced dataset with a balance class weight vector, however, I need to have this balance dataset, is not a matter of which model and which weights.
- https://github.com/scikit-learn-contrib/imbalanced-learn: a lot of question refers to this package. Below an example on how I am trying to use it.
Here the example:
from imblearn.ensemble import EasyEnsembleClassifier
eec = EasyEnsembleClassifier(random_state=42, sampling_strategy='not minority', n_estimators=2)
eec.fit(data_for, label_all.loc[data_for.index,'LABEL_O_majority'])
new_data = eec.estimators_samples_
However, the returned indexes are all the indexes of the initial data and they are repeated n_estimators
times.
Here the result:
[array([ 0, 1, 2, ..., 1196, 1197, 1198]),
array([ 0, 1, 2, ..., 1196, 1197, 1198])]
Finally, a lot of techniques use oversampling but would like to not use them. Only for class 1
I can tolerate oversampling, as it is very predictable.
I am wondering if really sklearn, or this contrib package do not have a function that does this.