2

I am having a trouble in classification problem.

I have almost 400k number of vectors in training data with two labels, and I'd like to train MLP which classifies data into two classes. However, the dataset is so imbalanced. 95% of them have label 1, and others have label 0. The accuracy grows as training progresses, and stops after reaching 95%. I guess this is because the network predict the label as 1 for all vectors.

So far, I tried dropping out layers with 0.5 probabilities. But, the result is the same. Is there any ways to improve the accuracy?

soshi shimada
  • 425
  • 1
  • 7
  • 21

3 Answers3

1

I think the best way to deal with unbalanced data is to use weights for your class. For example, you can weight your classes such that sum of weights for each class will be equal.

import pandas as pd

df = pd.DataFrame({'x': range(7),
                   'y': [0] * 2 + [1] * 5})
df['weight'] = df['y'].map(len(df)/2/df['y'].value_counts())

print(df)    
print(df.groupby('y')['weight'].agg({'samples': len, 'weight': sum}))   

output:

   x  y  weight
0  0  0    1.75
1  1  0    1.75
2  2  1    0.70
3  3  1    0.70
4  4  1    0.70
5  5  1    0.70
6  6  1    0.70

   samples  weight
y                 
0      2.0     3.5
1      5.0     3.5
Alex Ozerov
  • 988
  • 8
  • 21
1

You could try another classifier on subset of examples. SVMs, may work good with small data, so you can take let's say 10k examples only, with 5/1 proportion in classes.

You could also oversample small class somehow and under-sample the another.

You can also simply weight your classes.

Think also about proper metric. It's good that you noticed that the output you have predicts only one label. It is, however, not easily seen using accuracy.

Some nice ideas about unbalanced dataset here:

https://machinelearningmastery.com/tactics-to-combat-imbalanced-classes-in-your-machine-learning-dataset/

Remember not to change your test set.

DavidS1992
  • 823
  • 1
  • 8
  • 19
0

That's a common situation: the network learns a constant and can't get out of this local minimum.

When the data is very unbalanced, like in your case, one possible solution is a weighted cross entropy loss function. For instance, in tensorflow, apply a built-in tf.nn.weighted_cross_entropy_with_logits function. There is also a good discussion of this idea in this post.

But I should say that getting more data to balance both classes (if that's possible) will always help.

Maxim
  • 52,561
  • 27
  • 155
  • 209