39

Is there a numerically stable way to compute softmax function below? I am getting values that becomes Nans in Neural network code.

np.exp(x)/np.sum(np.exp(y))
desertnaut
  • 57,590
  • 26
  • 140
  • 166
Abhishek Bhatia
  • 9,404
  • 26
  • 87
  • 142
  • 3
    The answers here show the better way to calculate the softmax: http://stackoverflow.com/questions/34968722/softmax-function-python – Alex Riley Mar 04 '17 at 18:14
  • 3
    @ajcr The accepted answer at this link is actually poor advice. Abhishek, the thing the OP does even though they first didn't seem to understand why is the right thing to do. There are no numerically difficult steps in the softmax except overflow. So shifting all inputs to the left while mathematically being equivalent, removes the possibility of overflow, so is numerically an improvement. – Paul Panzer Mar 04 '17 at 19:00
  • Yes, although the author of that accepted answer acknowledges in the comments that subtracting the maximum does not introduce an "necessary term" but actually improves numerical stability (perhaps that answer should be edited...). In any case, the question of numerical stability is addressed in several of the other answers there. @AbhishekBhatia: do you think the link answers your question satisfactorily, or would a new answer here be beneficial? – Alex Riley Mar 04 '17 at 19:39

4 Answers4

78

The softmax exp(x)/sum(exp(x)) is actually numerically well-behaved. It has only positive terms, so we needn't worry about loss of significance, and the denominator is at least as large as the numerator, so the result is guaranteed to fall between 0 and 1.

The only accident that might happen is over- or under-flow in the exponentials. Overflow of a single or underflow of all elements of x will render the output more or less useless.

But it is easy to guard against that by using the identity softmax(x) = softmax(x + c) which holds for any scalar c: Subtracting max(x) from x leaves a vector that has only non-positive entries, ruling out overflow and at least one element that is zero ruling out a vanishing denominator (underflow in some but not all entries is harmless).

Footnote: theoretically, catastrophic accidents in the sum are possible, but you'd need a ridiculous number of terms. For example, even using 16 bit floats which can only resolve 3 decimals---compared to 15 decimals of a "normal" 64 bit float---we'd need between 2^1431 (~6 x 10^431) and 2^1432 to get a sum that is off by a factor of two.

Paul Panzer
  • 51,835
  • 3
  • 54
  • 99
47

Softmax function is prone to two issues: overflow and underflow

Overflow: It occurs when very large numbers are approximated as infinity

Underflow: It occurs when very small numbers (near zero in the number line) are approximated (i.e. rounded to) as zero

To combat these issues when doing softmax computation, a common trick is to shift the input vector by subtracting the maximum element in it from all elements. For the input vector x, define z such that:

z = x-max(x)

And then take the softmax of the new (stable) vector z


Example:

def stable_softmax(x):
    z = x - max(x)
    numerator = np.exp(z)
    denominator = np.sum(numerator)
    softmax = numerator/denominator

    return softmax

# input vector
In [267]: vec = np.array([1, 2, 3, 4, 5])
In [268]: stable_softmax(vec)
Out[268]: array([ 0.01165623,  0.03168492,  0.08612854,  0.23412166,  0.63640865])

# input vector with really large number, prone to overflow issue
In [269]: vec = np.array([12345, 67890, 99999999])
In [270]: stable_softmax(vec)
Out[270]: array([ 0.,  0.,  1.])

In the above case, we safely avoided the overflow problem by using stable_softmax()


For more details, see chapter Numerical Computation in deep learning book.

kmario23
  • 57,311
  • 13
  • 161
  • 150
  • I'm not sure if subtracting by the max is the best way to handle underflow. But as the answer by Paul suggests, underflow is less of an issue. – Adam Wilson Dec 23 '21 at 01:09
9

Extending @kmario23's answer to support 1 or 2 dimensional numpy arrays or lists. 2D tensors (assuming the first dimension is the batch dimension) are common if you're passing a batch of results through softmax:

import numpy as np


def stable_softmax(x):
    z = x - np.max(x, axis=-1, keepdims=True)
    numerator = np.exp(z)
    denominator = np.sum(numerator, axis=-1, keepdims=True)
    softmax = numerator / denominator
    return softmax


test1 = np.array([12345, 67890, 99999999])  # 1D numpy
test2 = np.array([[12345, 67890, 99999999], # 2D numpy
                  [123, 678, 88888888]])    #
test3 = [12345, 67890, 999999999]           # 1D list
test4 = [[12345, 67890, 999999999]]         # 2D list

print(stable_softmax(test1))
print(stable_softmax(test2))
print(stable_softmax(test3))
print(stable_softmax(test4))

 [0. 0. 1.]

[[0. 0. 1.]
 [0. 0. 1.]]

 [0. 0. 1.]

[[0. 0. 1.]]
David Parks
  • 30,789
  • 47
  • 185
  • 328
  • 1
    This is still underflowing for me – Nihar Karve May 04 '20 at 07:14
  • I've been using this for quite some time without an issue. Are you sure you didn't have NaNs or Infs going in as input? – David Parks May 04 '20 at 14:59
  • 1
    I see now - `np.seterr(all='raise')` will complain about underflows for large values *even though* the function works correctly. This is indeed the best solution. – Nihar Karve May 04 '20 at 16:33
  • 1
    This is a golden answer and works like a charm. Please make a strong heed of this solution as I tried more than half a dozen so-called softmax solutions and this was the only one that actually worked – Allohvk Oct 22 '22 at 08:07
-5

There is nothing wrong with calculating the softmax function as it is in your case. The problem seems to come from exploding gradient or this sort of issues with your training methods. Focus on those matters with either "clipping values" or "choosing the right initial distribution of weights".