The question of why the weights of a neural network cannot be initialized as 0's has been asked plenty of times. The answer is straightforward: zero initial weights would result in all nodes in a layer learning the same thing, hence the symmetry has to be broken.
However, what I failed to comprehend is that, why would initializing the weights to some random numbers close to zero work. Even more advanced initialization techniques such as Xavier modify only the variance, which remains close to zero. Some answers in the linked question point to the existence of multiple local optima, but I seriously doubt the validity of this argument because of the followings:
The (usual) cost function of an individual logistic regression has a unique minimum. Nonetheless this insight may not generalizable to more than one node, so let's forget it for now.
Assume for the sake of argument that multiple local optima exist. Then shouldn't the proper randomization technique be Monte-Carlo-ish-ly over the entire domain of possible weights, rather than some random epsilons about zero? What's stopping the weights from converging again after a couple of iterations? The only rationale I can think of is that there exists a global maximum at the origin and all local optima are nicely spread 'radially' so that a tiny perturbation in any direction is sufficient to move you down the gradient towards different local optima, which is highly improbable.
PS1: I am asking the question here in the main Stack Overflow site because my reference is here.
PS2: The answer to why the variance of the initial weights are scaled this way can be found here. However, it did not address my question of why random initialization would work at all due to the possibility of converging weights, or rather, the weights would 'diverge' to 'learn' different features.