Why is softmax function necessory? Why not simple normalization?

Question

I am not familiar with deep learning so this might be a beginner question. In my understanding, softmax function in Multi Layer Perceptrons is in charge of normalization and distributing probability for each class. If so, why don't we use the simple normalization?

Let's say, we get a vector x = (10 3 2 1) applying softmax, output will be y = (0.9986 0.0009 0.0003 0.0001).

Applying simple normalization (dividing each elements by the sum(16)) output will be y = (0.625 0.1875 0.125 0.166).

It seems like simple normalization could also distribute the probabilities. So, what is the advantage of using softmax function on the output layer?

score 3 · Accepted Answer · answered Aug 30 '17 at 19:55

Normalization does not always produce probabilities, for example, it doesn't work when you consider negative values. Or what if the sum of the values is zero?

But using exponential of the logits changes that, it is in theory never zero, and it can map the full range of the logits into probabilities. So it is preferred because it actually works.

score 2 · Answer 2 · answered Aug 30 '17 at 17:51

This depends on the training loss function. Many models are trained with a log loss algorithm, so that the values you see in that vector estimate the log of each probability. Thus, SoftMax is merely converting back to linear values and normalizing.

The empirical reason is simple: SoftMax is used where it produces better results.

Why is softmax function necessory? Why not simple normalization?

2 Answers2

Linked