What are the benefits of normalization of the inputs for neural networks?

Question

I have noticed that it decreases the gradient, but I am not sure if it really leads to good results.

Does this answer your question? [Why do we have to normalize the input for an artificial neural network?](https://stackoverflow.com/questions/4674623/why-do-we-have-to-normalize-the-input-for-an-artificial-neural-network) — nbro, Jan 07 '20 at 16:53

score 0 · Accepted Answer · edited Jan 07 '20 at 16:52

0

It is explained in this answer

If the input variables are combined linearly, as in an MLP, then it is rarely strictly necessary to standardize the inputs, at least in theory. The reason is that any rescaling of an input vector can be effectively undone by changing the corresponding weights and biases, leaving you with the exact same outputs as you had before. However, there are a variety of practical reasons why standardizing the inputs can make training faster and reduce the chances of getting stuck in local optima. Also, weight decay and Bayesian estimation can be done more conveniently with standardized inputs.

edited Jan 07 '20 at 16:52

nbro

15,395
32
113
196

answered Mar 27 '15 at 10:17

Mahesh Kumar Kodanda

291
2
9

I just have a question on the link you shared , I was not able to put a comment on the explanation with diagrams that Yura Zaletskyy has put , just wondering why the last axes of the weights , the horizontal ones are better ? – Farzad64 Mar 27 '15 at 14:31

eigenchris · Answer 2 · 2015-03-27T19:13:52.987

Feature scaling makes all features contribute equally during the gradient descent procedure, making optimization faster.

If you imagine a machine learning problem with two variables, one on the scale of 10 and the other on the scale of 1,000,000, gradient descent will think nearly all the error is in the second feature, even if the relative errors of both features are similar.

You could imagine the error surface for the above case as being a long, skinny ravine, and it is difficult to find the exact bottom of such a ravine if we treat both orthogonal directions with equal importance.

Feature scaling forces the ravine to become a nice, circular "bowl", and it is much easier to converge to the exact bottom since the optimization algorithm isn't distracted by any huge overwhelming features.

Also keep in mind that feature scaling will not change the relative location of optimum point in the feature space. Take linear regression as an example--if a feature is scaled by a constant c, the feature's weight will undergo the opposite transformation, giving you the same answer in the end.

w = inv(X'*X)*X'*y

Now try replacing the features X with a rescaled version QC where C is a diagonal column-scaling matrix.

w = inv(C'*Q'*Q*C)*C'*Q'*y
w = inv(C)*inv(Q'*Q)*inv(C')*C'*Q'*y
Cw = inv(Q'*Q)*Q'*y

So using new scaled features Q=X*inv(C) will give us new weights u=Cw with the same solution y.

What are the benefits of normalization of the inputs for neural networks?

2 Answers2