0

I have a very imbalanced dataset, where the majority class makes up 98% of the data and the minority class makes up 2% of the data. I've dug into this, and tried various methods of dealing with this imbalance, among them being changing the costs associated with classifying a true-positive false greater than classifying a true-negative true.

That being said, I'm also trying methods to balance the dataset. I'm currently undersampling the data, randomly choosing n from the larger class so that the two classes are equal in a new dataset. When I fit models and run classification metrics against this dataset using cross-validation (such as ROC-AUC, Matthews Correlation Coefficient), I have generally good results. However, when I run them against the entire dataset, using the model that was fitted on the balanced dataset, I receive terrible results.

My question being, how should I compare the results from the undersampled data?

  • Basically what is changing is your *prior probability* of obtaining a sample from the minority class. The estimates you make with the balanced classes seem too delicate to be able to survive in the new setting. Have you tried learning with less severe imbalance, such as 5:1, instead of learning only on balanced classes? Can you visualize where the problem lies, i.e. which samples are misclassified? – eickenberg Jun 09 '14 at 17:56
  • Also [the answer to this post](http://stackoverflow.com/questions/23455728/scikit-learn-balanced-subsampling/) seems to implement balanced subsampling, in case you are interested in trying somebody else's code against your own. – eickenberg Jun 09 '14 at 17:58
  • That makes perfect sense to me. I tried changing the balance to a less severe imbalance, and had much, much better results. I also used SMOTE (http://www.jair.org/media/953/live-953-2037-jair.pdf) to over-sample the minority class by creating new 'synthetic' samples, and that significantly improved results as well. – user3723025 Jun 10 '14 at 18:18

0 Answers0