What are the benefits of normalization of the inputs for neural networks?
I have noticed that it decreases the gradient, but I am not sure if it really leads to good results.
What are the benefits of normalization of the inputs for neural networks?
I have noticed that it decreases the gradient, but I am not sure if it really leads to good results.
It is explained in this answer
If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs.
Feature scaling makes all features contribute equally during the gradient descent procedure, making optimization faster.
If you imagine a machine learning problem with two variables, one on the scale of 10
and the other on the scale of 1,000,000
, gradient descent will think nearly all the error is in the second feature, even if the relative errors of both features are similar.
You could imagine the error surface for the above case as being a long, skinny ravine, and it is difficult to find the exact bottom of such a ravine if we treat both orthogonal directions with equal importance.
Feature scaling forces the ravine to become a nice, circular "bowl", and it is much easier to converge to the exact bottom since the optimization algorithm isn't distracted by any huge overwhelming features.
Also keep in mind that feature scaling will not change the relative location of optimum point in the feature space. Take linear regression as an example--if a feature is scaled by a constant c
, the feature's weight will undergo the opposite transformation, giving you the same answer in the end.
w = inv(X'*X)*X'*y
Now try replacing the features X
with a rescaled version QC
where C
is a diagonal column-scaling matrix.
w = inv(C'*Q'*Q*C)*C'*Q'*y
w = inv(C)*inv(Q'*Q)*inv(C')*C'*Q'*y
Cw = inv(Q'*Q)*Q'*y
So using new scaled features Q=X*inv(C)
will give us new weights u=Cw
with the same solution y
.