I am learning the backpropagation algorithm used to train neural networks. It kind of makes sense, but there is still one part I don't get.
As far as I understand, the error derivative is calculated with respect to all weights in the network. This results in an error gradient whose number of dimensions is the number of weights in the net. Then, the weights are changed by the negative of this gradient, multiplied by the learning rate.
This seems about right, but why is the gradient not normalized? What is the rationale behind the length of the delta vector being proportional to the length of the gradient vector?