2

I'm a little confused why the activation derivative in back propagation is how it is.

Firstly, when I remove the activation derivative from the back propagation algorithm and replace it with a constant the network still trains, although slightly slower. So I assume it's not essential to the algorithm, but it does provide a performance advantage.

However, if the activation derivative is (put simply), just the rate of change of the activation function with respect to the current input, then why does this offer a performance improvement?

Surely at values where the activation function is changing fastest we would want a smaller value so the weight update is smaller? This would prevent large output changes from occurring from weight changes near steep slopes on the activation function. However, this is the complete opposite of how the algorithm actually work.

Could someone explain to me why it's set up like it is and why that provides such a performance improvement?

user11406
  • 1,208
  • 1
  • 13
  • 25

1 Answers1

1

I'm not entirely sure if this is what you're asking for, but this answer may offer some insight into what you are trying to understand.

So imagine the error curve:

curve

We are trying to use gradient descent to minimize the cost function correct? Let's assume that we are at the very outside of the curve, where the error is very high. By calculating gradient descent with the curve, the function will realize that the slope is steep and therefore the error is high, so it will take a large step. As it traverses down the curve, the slope slowly approaches 0, and therefore will take smaller steps each time.

Visualization of gradient descent with the activation derivative:

ok

See how it starts from taking a big step and takes smaller steps each time? This is achieved by the use of the activation derivative. It starts out with a big step because there is a steep curve. As the slope gets smaller, the step gets smaller too.

If you used a constant value, you would have to pick a very small step in order to avoid overshooting the minimum, and therefore would have to use many more iterations in order to achieve a similar result.

Andrew Hu
  • 150
  • 1
  • 12
  • What if the error is extremely low, for example a target of 0.5 and an actual output of 0.49? Assuming the sigmoid activation function is used the activation derivative would be almost it's max value at 0.2499. This doesn't make sense based on your graph where lower error equals a lower derivative value. – user11406 Feb 04 '16 at 06:12
  • If you look at graphs of the derivative of the sigmoid activation function, you will see that the equation is 'dy/dx = f(x)' = f(x) * (1 - f(x))' and the graph will look like a curved logarithmic function. Look [here](https://theclevermachine.wordpress.com/2014/09/08/derivation-derivatives-for-common-neural-network-activation-functions/) and you will see that as the cost of the graph of the derivative of the activation function decreases, so does the slope. So for the example you presented, an error of 0.01 would yield a small change in the update function. – Andrew Hu Feb 04 '16 at 06:32
  • If you are still confused, take a look at [this](http://stackoverflow.com/questions/9785754/what-is-a-derivative-of-the-activation-function-used-for-in-backpropagation) question. It may answer some remaining questions you have. – Andrew Hu Feb 04 '16 at 06:36