0

I implemented the Softmax function and later discovered that it has to be stabilized in order to be numerically stable (duh). And now, it is again not stable because even after deducting the max(x) from my vector, the given vector values are still too big to be able to be the powers of e. Here is the picture of the code I used to pinpoint the bug, vector here is sample output vector from forward propagating:


enter image description here

We can clearly see that the values are too big, and instead of probability, I get these really small numbers which leads to small error which leads to vanishing gradients and finally making the network unable to learn.

rLoper
  • 63
  • 7

1 Answers1

1

You are completely right, just translating the mathematical definition of softmax might make it unstable, which is why you have to substract the maximum of x before doing any compution.

Your implementation is correct, and vanishing/exploding gradient is an independant problem that you might encounter depending on what kind of neural network you intent to use.

Alexis816
  • 34
  • 3
  • Dang, I don't know what to do then. I check my formulae and code so many times so I think it must be something practical. Do You maybe know what could it be (from experience)? I am using ReLU btw – rLoper Apr 25 '20 at 13:32
  • I don't see any problem with your code. Your implementation is correct. You might want to read this SO answer: https://stackoverflow.com/questions/17187507/why-use-softmax-as-opposed-to-standard-normalization. You are not using ReLU anywhere in your code, ReLU is a function that maps `x` to `max(0, x)`. – Alexis816 Apr 25 '20 at 14:07