Does pytorch apply softmax automatically in nn.Linear

Question

In pytorch a classification network model is defined as this,

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   # hidden layer
        self.out = torch.nn.Linear(n_hidden, n_output)   # output layer

    def forward(self, x):
        x = F.relu(self.hidden(x))      # activation function for hidden layer
        x = self.out(x)
        return x

Is softmax applied here? In my understanding, things should be like,

class Net(torch.nn.Module):
    def __init__(self, n_feature, n_hidden, n_output):
        super(Net, self).__init__()
        self.hidden = torch.nn.Linear(n_feature, n_hidden)   # hidden layer
        self.relu =  torch.nn.ReLu(inplace=True)
        self.out = torch.nn.Linear(n_hidden, n_output)   # output layer
        self.softmax = torch.nn.Softmax(dim=n_output)
    def forward(self, x):
        x = self.hidden(x)      # activation function for hidden layer
        x = self.relu(x)
        x = self.out(x)
        x = self.softmax(x)
        return x

I understand that F.relu(self.relu(x)) is also applying relu, but the first block of code doesn't apply softmax, right?

On a related note, if you're using [`nn.CrossEntropyLoss`](https://pytorch.org/docs/stable/nn.html#crossentropyloss) then that applies log-softmax followed by nll-loss. You probably want to make sure you're not applying softmax twice since softmax is **not** [idempotent](https://en.wikipedia.org/wiki/Idempotence). — jodag, Aug 16 '19 at 03:11

score 10 · Accepted Answer · answered Aug 16 '19 at 08:45

10

Latching on to what @jodag was already saying in his comment, and extending it a bit to form a full answer:

No, PyTorch does not automatically apply softmax, and you can at any point apply torch.nn.Softmax() as you want. But, softmax has some issues with numerical stability, which we want to avoid as much as we can. One solution is to use log-softmax, but this tends to be slower than a direct computation.

Especially when we are using Negative Log Likelihood as a loss function (in PyTorch, this is torch.nn.NLLLoss, we can utilize the fact that the derivative of (log-)softmax+NLLL is actually mathematically quite nice and simple, which is why it makes sense to combine the both into a single function/element. The result is then torch.nn.CrossEntropyLoss. Again, note that this only applies directly to the last layer of your network, any other computation is not affected by any of this.

answered Aug 16 '19 at 08:45

dennlinger

9,890
1
42
63

4

If I understand you correctly, it would be better to apply `nn.CrossEntropyLoss` as the loss function to the output of last layer `nn.Linear()`, instead of using `nn.Softmax()` directly. Is that correctly? – yujuezhao Aug 16 '19 at 13:56
And another question ensues, the output of `nn.Softmax()` can be considered as the probability of a certain class, while the sum of all outputs of `nn.Linear()` is not guaranteed to be equal to 1. Would that lose the meaning of the final output? – yujuezhao Aug 16 '19 at 14:04
2

To answer your first comment: You're not really replacing any layer with a loss function, but rather replace your current loss function (which should be `nn.NLLLoss`) with a different loss, while removing the last `nn.Softmax()`. I think the idea you had is already correct, though. The second question: Since your loss function still "applies" log softmax (or at least your derivatives are based on that), the interpretation still holds. If you are using the output in any other ways, e.g., during inference, you of course have to re-apply a softmax in that case. – dennlinger Aug 16 '19 at 15:29

Does pytorch apply softmax automatically in nn.Linear

1 Answers1