I'm a little confused why the activation derivative in back propagation is how it is.
Firstly, when I remove the activation derivative from the back propagation algorithm and replace it with a constant the network still trains, although slightly slower. So I assume it's not essential to the algorithm, but it does provide a performance advantage.
However, if the activation derivative is (put simply), just the rate of change of the activation function with respect to the current input, then why does this offer a performance improvement?
Surely at values where the activation function is changing fastest we would want a smaller value so the weight update is smaller? This would prevent large output changes from occurring from weight changes near steep slopes on the activation function. However, this is the complete opposite of how the algorithm actually work.
Could someone explain to me why it's set up like it is and why that provides such a performance improvement?