Assume we have a multi-class classification task with 3 classes:
{Cheesecake, Ice Cream, Apple Pie}
Given that we have a trained neural network that can classify which of the three desserts a random chef would prefer. Also, assume that the output layer consists of 3 neurons with softmax activation, such that each neuron represents the probability to like the corresponding dessert.
For example, possible outputs of such network might be:
Output(chef_1) = { P(Cheesecake) = 0.3; P(Ice Cream) = 0.1; P(Apple Pie) = 0.6; }
Output(chef_2) = { P(Cheesecake) = 0.2; P(Ice Cream) = 0.1; P(Apple Pie) = 0.7; }
Output(chef_3) = { P(Cheesecake) = 0.1; P(Ice Cream) = 0.1; P(Apple Pie) = 0.8; }
In such case, all instances (chef_1, chef_2 and chef_3) are likely to prefer an Apple Pie, but with a different confidence (e.g. chef_3 is more likely to prefer Apple Pie than chef_1 as the network probability outputs are 0.8 and 0.6 respectively)
Given that we have a new dataset of 1000 chefs, and we want to calculate the distribution of their favorite desserts, we would simply classify each one of the 1000 chefs and determine his favorite dessert based on the neuron with maximum probability.
We also want to improve the prediction accuracy by discarding chefs whose max prediction probability is below 0.6. Let's assume that 200 out of the 1000 were predicted with such probability, and we discarded them.
In such case, we may bias distribution over the 800 chefs (who were predicted with a probability higher than 0.6) if one dessert is easier to predict than another.
For example, if the average prediction probability of the classes are:
AverageP(Cheesecake) = 0.9
AverageP(Ice Cream) = 0.5
AverageP(Apple Pie) = 0.8
And we discard chefs who were predicted with probability which is lower than 0.6, among the 200 chefs that were discarded there are likely to be more chefs who prefer Ice Cream, and this will result in a biased distribution among the other 800.
Following this very long introduction (I am happy that you are still reading), my questions are:
Do we need a different threshold for each class? (e.g. among Cheesecake predictions discard instances whose probability is below X, among Ice Cream predictions discard instances whose probability is below Y, and among Apple Pie predictions discard instances whose probability is below Z).
If yes, how can I calibrate the thresholds without impacting the overall distribution on my 1000 chefs dataset (i.e. discard predictions with low probability in order to improve the accuracy, while preserving the distribution over the original dataset).
I've tried to use the average prediction probability of each class as a threshold, but I cannot assure that it will not impact the distribution (as these thresholds may overfit to the test set and not to the 1000 chefs dataset).
Any suggestions or related papers?