1

I am working on a multi-label text classification problem(Total target labels 90). The data distribution has a long tail and around 1900k records. Currently, I am working on a small sample of around 100k records with similar target distribution.

Some algorithms provide functionality to handle class imbalance like PAC, LinearSVC. Currently, I am also doing SMOTE to generate samples for all except majority and RandomUnderSampler to suppress the imbalance from the majority class.

Is it right to use both the algorithm parameter & imblearn pipelines at the same time to handle class imbalance?

feat_pipeline = FeatureUnion([('text', text_pipeline)])

estimators_list = [
                   ('PAC',PassiveAggressiveClassifier(max_iter=5000,random_state=0,class_weight='balanced')),
                   ('linearSVC', LinearSVC(class_weight='balanced'))
                  ]
estimators_ensemble = StackingClassifier(estimators=estimators_list, 
                                         final_estimator=LogisticRegression(solver='lbfgs',max_iter=5000))
ovr_ensemble = OneVsRestClassifier(estimators_ensemble)

classifier_pipeline = imblearnPipeline([
        ('features', feat_pipeline),
        ('over_sampling', SMOTE(sampling_strategy='auto')), # resample all classes but the majority class;
        ('under_sampling',RandomUnderSampler(sampling_strategy='auto')), # resample all classes but the minority class;
        ('ovr_ensemble', ovr_ensemble)
    ])
desertnaut
  • 57,590
  • 26
  • 140
  • 166
joel
  • 1,156
  • 3
  • 15
  • 42
  • If you question is: "Should I use both ... ?" then the (easy) answer is: if it works, then do it... if it doesn't try something else. If you're only using 100k of your 1900k samples, you have plenty for a test set. – TravisJ Apr 22 '20 at 17:39
  • @TravisJ in fact there is a reason not to do it, based on first principles; see answer below – desertnaut Apr 28 '20 at 11:42

1 Answers1

2

Is it right to use both the algorithm parameter & imblearn pipelines at the same time to handle class imbalance?

Let's take a minute to think what this may imply and if it actually makes sense.

Specific algorithms (or algorithm settings) for handling class imbalance naturally expect some actual imbalance in the data.

Now, if you have already artificially balanced your data (with SMOTE, majority class undersampling etc), what your algorithms will face at the end of the day is a balanced dataset, and not an imbalanced one. Needless to say, these algos have no way of "knowing" that this balance in the final data they see is an artificial one; so, from their point of view, no imbalance - hence no need for any special recipe to kick-in.

So, it's not that it is wrong to do so, but in such a case these specific algorithms/settings will in fact not be useful, in the sense that they will not have anything extra to offer regarding the handling of class imbalance.

Quoting from an older answer of mine (completely different issue, but the general idea holds horizontally):

The field of deep neural nets is still (very) young, and it is true that it has yet to establish its "best practice" guidelines; add the fact that, thanks to an amazing community, there are all sort of tools available in open source implementations, and you can easily find yourself into the (admittedly tempting) position of mixing things up just because they happen to be available. I am not necessarily saying that this is what you are attempting to do here - I am just urging for more caution when combining ideas that may have not been designed to work along together...

desertnaut
  • 57,590
  • 26
  • 140
  • 166