Implementation of a softmax activation function for neural networks

Question

I am using a Softmax activation function in the last layer of a neural network. But I have problems with a safe implementation of this function.

A naive implementation would be this one:

Vector y = mlp(x); // output of the neural network without softmax activation function
for(int f = 0; f < y.rows(); f++)
  y(f) = exp(y(f));
y /= y.sum();

This does not work very well for > 100 hidden nodes because the y will be NaN in many cases (if y(f) > 709, exp(y(f)) will return inf). I came up with this version:

Vector y = mlp(x); // output of the neural network without softmax activation function
for(int f = 0; f < y.rows(); f++)
  y(f) = safeExp(y(f), y.rows());
y /= y.sum();

where safeExp is defined as

double safeExp(double x, int div)
{
  static const double maxX = std::log(std::numeric_limits<double>::max());
  const double max = maxX / (double) div;
  if(x > max)
    x = max;
  return std::exp(x);
}

This function limits the input of exp. In most of the cases this works but not in all cases and I did not really manage to find out in which cases it does not work. When I have 800 hidden neurons in the previous layer it does not work at all.

However, even if this worked I somehow "distort" the result of the ANN. Can you think of any other way to calculate the correct solution? Are there any C++ libraries or tricks that I can use to calculate the exact output of this ANN?

edit: The solution provided by Itamar Katz is:

Vector y = mlp(x); // output of the neural network without softmax activation function
double ymax = maximal component of y
for(int f = 0; f < y.rows(); f++)
  y(f) = exp(y(f) - ymax);
y /= y.sum();

And it really is mathematically the same. In practice however, some small values become 0 because of the floating point precision. I wonder why nobody ever writes these implementation details down in textbooks.

"I wonder why nobody ever writes these implementation details down in textbooks." I've always wondered the same thing! — pjreddie, Jan 28 '14 at 21:03
"It really is mathematically the same" - reading further, someone says your method is preferred due to numerical stability.: https://stackoverflow.com/questions/34968722/softmax-function-python — gremwell, Jun 19 '17 at 05:57

score 14 · Accepted Answer · edited Aug 16 '23 at 21:56

14

First go to log scale, i.e calculate log(y) instead of y. The log of the numerator is trivial. In order to calculate the log of the denominator, you can use the following 'trick': http://lingpipe-blog.com/2009/06/25/log-sum-of-exponentials/

edited Aug 16 '23 at 21:56

Henry Ecker

34,399
18
41
57

answered Mar 28 '12 at 11:49

Itamar Katz

9,544
5
42
74

A perfect solution. I will add the code in a minute. Could you confirm that please? Thank you very much. – alfa Mar 28 '12 at 13:40
It doesn't seem correct; follow the algebra of what `log(y(f))` is: **log(y(f))=log(exp(y(f))) - log(sum(exp(y(f)))** and plug in the mentioned 'trick' result for the log of the sum. – Itamar Katz Mar 28 '12 at 14:53
ln(y_f) = ln(exp(a_f)) - ln(sum over f' exp(a_f')) = af - ln[sum over f' exp(m)/exp(m) * exp(a_f')] = a_f - m - ln(sum over f' exp(-m) * exp(a_f)) = a_f - m - ln[sum over f' exp(a_f'-m)] <=> y_f exp(a_f-m)/(sum over f' exp(a_f' - m)). a_f is y_f before exp() in the listed code above. Where is the error? :D – alfa Mar 28 '12 at 18:03
And I did a test with a_1 = 1, a_2 = 2, a_3 = 3. The vector y is in both cases y = (0.090031,0.24473,0.66524)^T. At least in this case it seems to be correct. – alfa Mar 28 '12 at 18:25

score 8 · Answer 2 · answered Mar 23 '15 at 02:54

8

I know it's already answered but I'll post here a step-by-step anyway.

put on log:

zj = wj . x + bj
oj = exp(zj)/sum_i{ exp(zi) }
log oj = zj - log sum_i{ exp(zi) }

Let m be the max_i { zi } use the log-sum-exp trick:

log oj = zj - log {sum_i { exp(zi + m - m)}}
   = zj - log {sum_i { exp(m) exp(zi - m) }},
   = zj - log {exp(m) sum_i {exp(zi - m)}}
   = zj - m - log {sum_i { exp(zi - m)}}

the term exp(zi-m) can suffer underflow if m is much greater than other z_i, but that's ok since this means z_i is irrelevant on the softmax output after normalization. final results is:

oj = exp (zj - m - log{sum_i{exp(zi-m)}})

answered Mar 23 '15 at 02:54

Khanis Rok

617
6
12

Thanks! Your answer helps! You mentioned "but that's ok since this means z_i is irrelevant on the softmax output after normalization", do you mean if underflow of `exp(zi-m)` happens. It does not add much error in the result? – Helin Wang Feb 01 '17 at 17:25
Sorry the late reply. Yes, if m >> zi then exp(zi-m) would be near 0, the underflow just changes it to 0, which doesn't change much of the final results. – Khanis Rok Jun 26 '17 at 20:13

Implementation of a softmax activation function for neural networks

2 Answers2

Linked