3

I have a deep learning classification problem with 17 classes and I am working in Pytorch. The architecture includes the crossEntropy loss, implemented after a linear layer.

I believe that, normally, one computes a softmax activation and interprets as probablity for the corresponding output classes. But softmax is a monotonic function and it seems that, if I just want the most probable class, I can simply choose the class with the maximum score after the linear layer, leaving the softmax out.

Given that softmax is the default, widely used activation in classification problems, I wonder if I am missing something important here. Can anyone guide me?

Note that I have googled a large number of sites but, as far as I could understand, none answering this basic question (although there was a lot of information that was provided).

Thanks

abby yorker
  • 357
  • 4
  • 19
  • 1
    [This](https://stackoverflow.com/questions/50986957/why-not-use-the-max-value-of-output-tensor-instead-of-softmax-function) and [this](https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization) might help. – Keyur Potdar May 12 '19 at 08:16
  • Thanks, these are useful. – abby yorker May 12 '19 at 19:11

1 Answers1

4

You are right in that you don't need softmax to predict the most probable class - you can indeed just take the class with the highest score.

Howewer, you need softmax in the training time to calculate the loss function (cross-entropy), because it works well only with probability distributions over classes. The softmax transform guarantees that the output of your network does indeed look like a distribution: all scores are positive and they sum up to 1. If they weren't positive, you could not calculate cross-entropy, because it involves logarithms. And if the scores didn't sum to one (or any other constant), then the model could minimize loss by making all the scores infinitely large, without actually learning anything useful.

Moreover, at the prediction time softmax can be useful as well, because when you report probability instead of just score, you can interpret it as confidence: e.g. the model is 98% sure in its prediction.

In some cases, it is not the most probable class that you are interested in. E.g. if you do credit scoring, then even low probability of default (e.g. 20%) may be high enough to reject an application for loan. In such cases, instead of the most probable class you want to look at the probabilities themselves - and softmax helps to estimate them correctly.

David Dale
  • 10,958
  • 44
  • 73
  • OK thanks. I will try adding it and see what it does to the convergence. – abby yorker May 12 '19 at 19:12
  • OK this is the answer I wanted. It turns out that pytorch builds the softmax directly into cross entropy so I did not need to add it in. – abby yorker May 14 '19 at 21:59
  • As stated [here](https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization) you still need softmax for avoiding the division by 0 – Hermes Morales Mar 12 '21 at 18:39