Classification: What happens if one class has 4 times as much data as the other class?

Question

I am trying to debug an issue with my classifier. The issue is that it always predicts the same class for a given input despite having close to an 80% accuracy.

I trained my CNN to detect the difference between 2 classes. class A has 2575 jpegs and class B has 665 jpegs.

Could this have caused my issue with my CNN always predicting the same class? Is this too much of an imbalance between the # of items in each class? In general, will my performance improve if I make the size of both classes the same(at 665 jpegs?)?

I think it is better to ask in [CrossValidated](https://stats.stackexchange.com/) — pe-perry, Aug 14 '17 at 02:51

score 9 · Accepted Answer · answered Aug 14 '17 at 02:55

9

The problem seems to be a case of class imbalance and there are different ways to handle it:

Weighted loss: You can penalise the reward for the majority loss function by computing a weighted cross entropy.
Resampling the data: As you mentioned you can also downsample the majority class, to balance the classes. You can also upsample the minority class to make it even.
Generate augmented data: Since you are handling images, you can upsample the minority class and then use data augmentation on those images, this solves the class imbalance as well as tackles overfitting and improves generalisation.
and Combination of all the above.

answered Aug 14 '17 at 02:55

Vijay Mariappan

16,921
3
40
59

I like your answer @vijay. I wonder if (3.) can be harmful when data augmentation is only applied to underrepresented classes. Does the distortion in distribution pose a risk to the classifiers performance? – Gegenwind Mar 20 '18 at 08:14
@Gegenwind what i meant is upsample under-represented classes and then apply data augmentation for all class. – Vijay Mariappan Mar 20 '18 at 15:37

Classification: What happens if one class has 4 times as much data as the other class?

1 Answers1

Linked