I am not familiar with deep learning so this might be a beginner question. In my understanding, softmax function in Multi Layer Perceptrons is in charge of normalization and distributing probability for each class. If so, why don't we use the simple normalization?
Let's say, we get a vector x = (10 3 2 1)
applying softmax, output will be y = (0.9986 0.0009 0.0003 0.0001)
.
Applying simple normalization (dividing each elements by the sum(16)
)
output will be y = (0.625 0.1875 0.125 0.166)
.
It seems like simple normalization could also distribute the probabilities. So, what is the advantage of using softmax function on the output layer?