According to He et al 2015 Eq. 15, the theoretical weight variance for one layer when using ReLu becomes:
n*Var[W] = 2
where n is the layer size.
If you want to use pooled variance of both the in layer and the out layer, then it becomes:
(fan_in, fan_out) = ...
low = -2*np.sqrt(1.0/(fan_in + fan_out))
high = 2*np.sqrt(1.0/(fan_in + fan_out))
If you are using tensorflow, they have a variance_scaling_initializer, where you can set the factor variable and the mode variable to control how you want the initialization to be.
If you use the default setting of argument factor=2.0 for this initializer, you'll get the initialization variances suggested by He et al 2015 for ReLu activation. Although you can play around with the argument mode to get slightly different weight initialization variances. Only use in layer:
tf.contrib.layers.variance_scaling_initializer(factor=2.0, mode='FAN_IN')
would give you following:
(fan_in, fan_out) = ...
low = -np.sqrt(2.0/fan_in)
high = np.sqrt(2.0/fan_in)
Use both in and out layers:
tf.contrib.layers.variance_scaling_initializer(factor=2.0, mode='FAN_AVG')
would give you:
(fan_in, fan_out) = ...
low = -np.sqrt(4.0/(fan_in+fan_out)) = -2.0*np.sqrt(1.0/(fan_in+fan_out))
high = np.sqrt(4.0/(fan_in+fan_out)) = 2.0*np.sqrt(1.0/(fan_in+fan_out))
Only use out layer:
tf.contrib.layers.variance_scaling_initializer(factor=2.0, mode='FAN_AVG')
would give you:
(fan_in, fan_out) = ...
low = -np.sqrt(2.0/fan_out)
high = np.sqrt(2.0/fan_out)