How to cancel the huge negative effect of my training data distribution on subsequent neural network classification function?

Question

I need to train my network on a data that has a normal distribution, I've noticed that my neural net has a very high tendency to only predict the most occurring class label in a csv file I exported (comparing its prediction with the actual label).

What are some suggestions (except cleaning the data to produce an evenly distributed training data), that would help my neural net to not go and only predict the most occurring label?

UPDATE: Just wanted to mention that, indeed the suggestions made in the comment sections worked. I, however, found out that adding an extra layer to my NN, mitigated the problem.

I'm using my own NN code. It has about a 1000 node input, a hundred node hidden layer, and a 10 node output layer. It's a sigmoid NN. — Mostafa Zamani, Apr 03 '16 at 02:14
(1) What is the training data class distribution? in particular, how frequent is the most frequent class? (2) If you do train on an evenly distributed training set, does this problem diminish? — Tomer Levinboim, Apr 03 '16 at 02:34
one label constitutes about 50% of the labels. I did test my code on mnist and it had above 98 percent accuracy, however I didnot test it on an evenly distributed data that is mine. Clearly bcz of the lack of such data. What do you think? — Mostafa Zamani, Apr 03 '16 at 02:38
There is this related question: http://stackoverflow.com/questions/33132251/is-it-important-for-a-neural-network-to-have-normally-distributed-data — Mostafa Zamani, Apr 03 '16 at 02:58
Are you using mini-batches? if so, you could simulate an evenly distributed training data by making sure each mini-batch is evenly distributed. — Tomer Levinboim, Apr 03 '16 at 03:01
Yea, I'm using mini-batches, that's actually a good suggestion! but hard to implement. — Mostafa Zamani, Apr 03 '16 at 03:03
What makes it hard to implement? (which language are you coding in?) — Tomer Levinboim, Apr 03 '16 at 03:20
I'm using python, and I have the data in numpy. I guess I can use numpy clip. — Mostafa Zamani, Apr 03 '16 at 03:22
I don't see how clip() could help you here. But it really should not be difficult (say, if you set up "class number"=>"list of samples belonging to that class" dictionary) — Tomer Levinboim, Apr 03 '16 at 03:29

Tomer Levinboim · Accepted Answer · 2016-04-03T04:54:10.590

0

Assuming the NN is trained using mini-batches, it is possible to simulate (instead of generate) an evenly distributed training data by making sure each mini-batch is evenly distributed.

For example, assuming a 3-class classification problem and a minibatch size=30, construct each mini-batch by randomly selecting 10 samples per class (with repetition, if necessary).

edited Apr 03 '16 at 04:54

answered Apr 03 '16 at 03:12

Tomer Levinboim

992
12
18

How to cancel the huge negative effect of my training data distribution on subsequent neural network classification function?

1 Answers1