4

The question of why the weights of a neural network cannot be initialized as 0's has been asked plenty of times. The answer is straightforward: zero initial weights would result in all nodes in a layer learning the same thing, hence the symmetry has to be broken.

However, what I failed to comprehend is that, why would initializing the weights to some random numbers close to zero work. Even more advanced initialization techniques such as Xavier modify only the variance, which remains close to zero. Some answers in the linked question point to the existence of multiple local optima, but I seriously doubt the validity of this argument because of the followings:

The (usual) cost function of an individual logistic regression has a unique minimum. Nonetheless this insight may not generalizable to more than one node, so let's forget it for now.

Assume for the sake of argument that multiple local optima exist. Then shouldn't the proper randomization technique be Monte-Carlo-ish-ly over the entire domain of possible weights, rather than some random epsilons about zero? What's stopping the weights from converging again after a couple of iterations? The only rationale I can think of is that there exists a global maximum at the origin and all local optima are nicely spread 'radially' so that a tiny perturbation in any direction is sufficient to move you down the gradient towards different local optima, which is highly improbable.

PS1: I am asking the question here in the main Stack Overflow site because my reference is here.

PS2: The answer to why the variance of the initial weights are scaled this way can be found here. However, it did not address my question of why random initialization would work at all due to the possibility of converging weights, or rather, the weights would 'diverge' to 'learn' different features.

desertnaut
  • 57,590
  • 26
  • 140
  • 166
Moobie
  • 1,445
  • 14
  • 21
  • 3
    One reason is probably due to SGD being not scale-invariant which makes very different weights (resulting from your init) hard to recover from. – sascha Nov 03 '17 at 20:42
  • 1
    One of the reasons the weights should be initialized close to zero is that if you have deep network and your weights are more than 1 you get them multiplied and propogated, then you can end-up with really large values in your sigmoid and softmax functions, where you have exp(x) elements. – asakryukin Nov 04 '17 at 06:06

1 Answers1

2

You've hit the main reason: we need the kernels to differ so that the kernels (nodes) differentiate their learning.

First of all, random initialization doesn't always work; depending on how closely you've tuned your model structure and hyper-parameters, sometimes the model fails to converge; this is obvious from the loss function in the early iterations.

For some applications, there are local minima. However, in practical use, the happy outgrowth of problem complexity is that those minima have very similar accuracy. In short, it doesn't matter which solution we find, so long as we find one. For instance, in image classification (e.g. the ImageNet contest), there are many features useful in identifying photos. As with (simpler) PCA, when we have a sets of features that correlate highly with the desired output and with each other, it doesn't matter which set we use. Those features are cognate to the kernels of a CNN.

Prune
  • 76,765
  • 14
  • 60
  • 81