Why does deep learning not suffer from float or numerical precision errors if most of its training is on data with mean 0 and std 1?

Question

Inspired by the question:

Why do different methods for solving Xc=y in python give different solution when they should not?

that seems to have numerical issue due to floating points, inverting matrices and restricting values to [-1,1], what I am curious now is why does deep learning not suffer from float or numerical precision errors if most of its training is on data with mean 0 and std 1 (I guess I am assuming that most of the data has been pre-processed to be in that range, plus I feel this has to be roughly right considering the high usage of batch-normalization). Is it because deep learning does not train by raising a polynomial to a very high degree, or why is deep learning usually fine? Is there something special with SGD or maybe the (popular) activation function, relu, elu, etc are not numerically unstable (compared to a high degree polynomial)? Or maybe the GPU training avoids floating point representation all together? Or why is deep learning training numerically stable?

score 2 · Answer 1 · answered Oct 21 '17 at 23:21

There is nothing really magical about DL as such - it suffers from numerical errors too, all the time. However, due to the scale and number of nonlinearities, numerical instabilities in DL usually lead to infinities or nans, not - wrong answers. Consequently they are usually easy to detect. In particular there is nothing hard about [0,1] interval, in fact, it is a great storage spot for floats, as quarter of representable floats actually live in [0,1]! The problem you are referring to lies in taking huge exponent of such a number, which goes dangerously close to machine precision. None of the standard DL techniques takes 30th power of any activation. In fact, most of the most succesfull DL techniques (based on sigmoids, tanhs and relus) are almost linear, and so the numerical instabilities come mostly from exp operations in probability estimates.

So:

is it about high degree polynomial? yes, this is the main issue, and is not encountered in DL.
is there something special about SGD? Not really.
is it about activation functions? Yes, they do not let such huge precision drops (exponent is the exception though, and it does lead to numerical issues)
is GPU avoiding floats? No, it is not, GPUs have nothing to do with it.

Why does deep learning not suffer from float or numerical precision errors if most of its training is on data with mean 0 and std 1?

1 Answers1