0

I would like to perform Logistic Regression using Vowpal Wabbit. How can I handle imbalanced classes (e.g. 1000/50000)? I know that I can use importance weighting but I'm not sure this is the best option in this case. There also exist some algorithms like SMOTE but I don't know how to use them in Vowpal Wabbit.

max04
  • 5,315
  • 3
  • 13
  • 21

1 Answers1

0

Yes, importance weighting is the solution for imbalanced classes in Vowpal Wabbit. The most important question is what is your final evaluation criterion. Is it Area Under RO Curve (aka ROC, AUC)? See Calculating AUC when using Vowpal Wabbit and How to perform logistic regression using vowpal wabbit on very imbalanced dataset (here see both answers).

SMOTE seems to be a combination of over-sampling the minority class and under-sampling the majority class, where the oversampling is done by generating synthetic examples from e.g. 5 nearest neighbor examples, which are randomly mixed together. This method is not implemented in Vowpal Wabbit and it is not compatible with online learning (because of the nearest neighbors). It could be probably approximated in online fashion somehow.

Community
  • 1
  • 1
Martin Popel
  • 2,671
  • 12
  • 22
  • Probably it will be F1 Score and AUC. I will also use Lift chart. So, only importance weighting in the case of online learning? – max04 Nov 11 '15 at 15:36
  • 1
    The oversampling (and undersampling) if done correctly should be very similar to the importance weighting. In both approaches you need to find the optimal constant e.g. with cross-validation. The generated synthetic examples should reduce [variance](http://en.wikipedia.org/wiki/Bias%E2%80%93variance_tradeoff). Bagging (`--bootstrap M`) can be used for the same purpose (see http://stackoverflow.com/questions/30008991/gradient-boosting-on-vowpal-wabbit/30035042#30035042). – Martin Popel Nov 11 '15 at 21:17