Derivative of a softmax function explanation

Question

I am trying to compute the derivative of the activation function for softmax. I found this : https://math.stackexchange.com/questions/945871/derivative-of-softmax-loss-function nobody seems to give the proper derivation for how we would get the answers for i=j and i!= j. Could someone please explain this! I am confused with the derivatives when a summation is involved as in the denominator for the softmax activation function.

I'm voting to close this question as off-topic because it has nothing to do with programming — desertnaut, Mar 16 '18 at 18:06
yes it does. There is a thing called the softmax function in neural networks and although one can use libraries, knowing the underlying math is an advantage. @desertnaut — mLstudent33, Dec 09 '21 at 08:34
@mLstudent33 we have no less that 3 (!) dedicated SE sites for such *non-programming* ML questions, which are off-topic here; please see the intro and NOTE in https://stackoverflow.com/tags/machine-learning/info — desertnaut, Dec 09 '21 at 09:32
I’m voting to close this question because it is not about programming as defined in the [help] but about ML theory and/or methodology - please see the note in https://stackoverflow.com/tags/neural-network/info — desertnaut, Dec 09 '21 at 09:33
@mLstudent33 and sincere thanks for the mini-lecture on softmax and libraries, but I think I got this https://stackoverflow.com/questions/34968722/how-to-implement-the-softmax-function-in-python/38250088#38250088 — desertnaut, Dec 09 '21 at 09:40

SirGuy · Accepted Answer · 2017-10-31T12:26:51.027

17

The derivative of a sum is the sum of the derivatives, ie:

    d(f1 + f2 + f3 + f4)/dx = df1/dx + df2/dx + df3/dx + df4/dx

To derive the derivatives of p_j with respect to o_i we start with:

    d_i(p_j) = d_i(exp(o_j) / Sum_k(exp(o_k)))

I decided to use d_i for the derivative with respect to o_i to make this easier to read. Using the product rule we get:

     d_i(exp(o_j)) / Sum_k(exp(o_k)) + exp(o_j) * d_i(1/Sum_k(exp(o_k)))

Looking at the first term, the derivative will be 0 if i != j, this can be represented with a delta function which I will call D_ij. This gives (for the first term):

    = D_ij * exp(o_j) / Sum_k(exp(o_k))

Which is just our original function multiplied by D_ij

    = D_ij * p_j

For the second term, when we derive each element of the sum individually, the only non-zero term will be when i = k, this gives us (not forgetting the power rule because the sum is in the denominator)

    = -exp(o_j) * Sum_k(d_i(exp(o_k)) / Sum_k(exp(o_k))^2
    = -exp(o_j) * exp(o_i) / Sum_k(exp(o_k))^2
    = -(exp(o_j) / Sum_k(exp(o_k))) * (exp(o_j) / Sum_k(exp(o_k)))
    = -p_j * p_i

Putting the two together we get the surprisingly simple formula:

    D_ij * p_j - p_j * p_i

If you really want we can split it into i = j and i != j cases:

    i = j: D_ii * p_i - p_i * p_i = p_i - p_i * p_i = p_i * (1 - p_i)

    i != j: D_ij * p_i - p_i * p_j = -p_i * p_j

Which is our answer.

edited Oct 31 '17 at 12:26

answered Jun 13 '16 at 13:55

SirGuy

10,660
2
36
66

thank you so much! This is so clear. I couldn't have asked for a better explanation! :) I am glad I understand the derivation completely now. I am going to refer this to the unanswered one on math.stack exchange! – Roshini Jun 13 '16 at 14:09
@SirGuy shouldn't your third expression be `d_i(exp(o_j)) / Sum_k(exp(o_k)) + exp(o_j) * d_i(1/Sum_k(exp(o_k)))` ? Missing exp before the last `o_k` – Benjamin Crouzier Oct 31 '17 at 10:29
@BenjaminCrouzier Thanks, fixed it – SirGuy Oct 31 '17 at 12:27
>Looking at the first term, the derivative will be 0 if i != j Why is this the case? The output o_i (i.e , a particular node of softmax) depends on all the values from the incoming layer. Wont this mean that if i!=j , the values will be different from i=k but not 0 ?see https://eli.thegreenplace.net/2016/the-softmax-function-and-its-derivative – harveyslash Jul 06 '18 at 10:17
@harveyslash Each variable is indenpendent, so the partial derivative says exactly that this is 0. When taking the partial derivative with respect to some variable, you treat every other variable as constant in the process. – SirGuy Jul 08 '18 at 03:43
@harveyslash The derivative doesn't look at what the output is, it looks at how that output changes when you vary just one variable, this is why all other variables are treated as constant. I hope this clears this up a bit. – SirGuy Jul 08 '18 at 03:44
please see my question that i have posted in detail https://math.stackexchange.com/questions/2843505/derivative-of-softmax-without-cross-entropy My derivative came out to be non 0 for the case that you have 0 for. Am I mistaken ? – harveyslash Jul 08 '18 at 07:58
1

@harveyslash First, In your question you link to, you incorrectly say that you add up the elements of the Jacobian to get the 'final' derivative. This is incorrect, think instead of the Jacobian as being the derivative and not an intermediate step that leads to the derivative. – SirGuy Jul 09 '18 at 18:18
1

@harveyslash in my solution the `i` and `j` refer to the elements of the Jacobian matrix. you seem to think that the 'thing' that goes to 0 is the derivative, but it's just one part of the partial derivative. You wrote out each derivative manually (for 4 inputs) whereas I treated the general case. – SirGuy Jul 09 '18 at 18:28
1

@harveyslash The thing that went to 0 was the subexpression `d_i(exp(o_j))` which is part of the subexpression `d_i(exp(o_j)) / Sum_k(exp(o_k))`. Look carefully at the parentheses and you will see that this is `the derivative of `exp(o_j)` with respect to `o_i` divided by `Sum over k of exp(o_k)`. The derivative of `Sum_k(exp(o_k))` with respect to `o_i` is taken care of in the second part of the product rule expansion. Does this help clear things up? – SirGuy Jul 09 '18 at 18:28
It does. I think a detailed answer to my question would be of great help to others too :) – harveyslash Jul 21 '18 at 13:54

Benjamin Crouzier · Answer 2 · 2017-10-31T14:55:52.603

8

For what it's worth, here is my derivation based on SirGuy answer: (Feel free to point errors if you find any).

edited Oct 31 '17 at 14:55

answered Oct 31 '17 at 14:48

Benjamin Crouzier

40,265
44
171
236

thanks very much for this! I have just one doubt: why does `Σ_k ( ( d e^{o_k} ) / do_i )` evaluate to `e^{o_i}` from step 4 to 5? I'd be very grateful for any insights you can offer on that question. – duhaime Dec 31 '17 at 05:10
2

@duhaime Good question. Think about all the terms of that sum one by one and see what happens to each term. You see that you have two cases: When i = k, the term is `d/do_i e^o_i` which is `e^o_i`. When i != k, you get a bunch of zeroes. – Benjamin Crouzier Dec 31 '17 at 11:24

Derivative of a softmax function explanation

2 Answers2

Linked