12

I'd like to know the possible ways to implement batch normalization layers with synchronizing batch statistics when training with multi-GPU.

Caffe Maybe there are some variants of caffe that could do, like link. But for BN layer, my understanding is that it still synchronizes only the outputs of layers, not the means and vars. Maybe MPI can synchronizes means and vars but I think MPI is a little difficult to implemnt.

Torch I've seen some comments here and here, which show the running_mean and running_var can be synchronized but I think batch mean and batch var can not or are difficult to synchronize.

Tensorflow Normally, it is the same as caffe and torch. The implementation of BN refers this. I know tensorflow can distribute an operation to any device specified by tf.device(). But the computation of means and vars is in the middle of BN layer, so if I gather the means and vars in cpu, my code will be like this:

cpu_gather = []
label_batches = []
for i in range(num_gpu):
    with tf.device('/gpu:%d' % i):
        with tf.variable_scope('block1', reuse=i > 0):
            image_batch, label_batch = cifar_input.build_input('cifar10', train_data_path, batch_size, 'train')
            label_batches.append(label_batch)

            x = _conv('weights', image_batch, 3, 3, 16, _stride_arr(1))
            block1_gather.append(x)

with tf.device('/cpu:0'):
    print block1_gather[0].get_shape()
    x1 = tf.concat(block1_gather, 0)
    # print x1.get_shape()
    mean, variance = tf.nn.moments(x1, [0, 1, 2], name='moments')

for i in range(num_gpu):
    with tf.device('/gpu:%d' % i):
        with tf.variable_scope('block2', reuse=i > 0):
            shape = cpu_gather[i].get_shape().as_list()
            assert len(shape) in [2, 4]
            n_out = shape[-1]
            beta, gamma, moving_mean, moving_var = get_bn_variables(n_out, True, True)

            x = tf.nn.batch_normalization(
                cpu_gather[i], mean, variance, beta, gamma, 0.00001)

            x = _relu(x)

That is just for one BN layer. For gathering statistics in cpu, I have to break the code. If I have more than 100 BN layers, that will be cumbersome.

I am not expert in those libraries so maybe there are some misunderstanding, feel free to point out my errors.

I do not care much about training speed. I am doing image segmentation which consumes much GPU memory and BN needs a reasonable batch size (e.g. larger than 16) for stable statistics. So using multi-GPU is inevitable. In my opinion, tensorflow might be the best choice but I can't resolve the breaking code problem. Solution with other libraries will be welcome too.

Paolo Forgia
  • 6,572
  • 8
  • 46
  • 58
LI Xuhong
  • 2,339
  • 2
  • 17
  • 32

3 Answers3

3

I'm not sure if I fully understand your question, but provided you set up your variable scope properly, the tf.GraphKeys.UPDATE_OPS collection should automatically have the update ops for batch_norm for each of your towers. If all of the update_ops are applied synchronously, they will be implicitly averaged by the parameter server, all you have to do is make sure the updates are applied before you average and apply gradients. (If I understand your intentions correctly).

Because of variable scope each set of update ops will update the same variables, so to synchronize the update ops all you need to do is gate your gradient calculation on the complete set of update ops. You should also encapsulate all of your batch norm layers in a single name_scope to avoid grabbing any extraneous ops in UPDATE_OPS. Code skeleton below:

update_ops = []
for i, device in enumerate(devices):
  with tf.variable_scope('foo', reuse=bool(i > 0)):
    with tf.name_scope('tower_%d' % i) as name_scope:
      with tf.device(device):
        # Put as many batch_norm layers as you want here
      update_ops.extend(tf.get_collection(tf.GraphKeys.UPDATE_OPS,
                                          name_scope))
# make gradient calculation ops here
with tf.device(averaging_device):
  with tf.control_dependencies(update_ops):
    # average and apply gradients.

If you wanna try this on some existing code, try just deleting the if i == 0 line here: https://github.com/tensorflow/models/blob/master/tutorials/image/cifar10_estimator/cifar10_main.py#L115

You're going to see some slow down (we usually only use one tower to compute batch norm statistics for this reason), but it should do what you want.

Eli Bixby
  • 1,178
  • 5
  • 13
  • Thanks Eli Bixby. Did you success the training BN with multiple GPUs. Please look at my question and give me some comments https://stackoverflow.com/questions/48150720/how-to-update-variable-of-batchnorm-in-multiple-gpus-in-tensorflow – Jame Jan 08 '18 at 14:09
  • Thanks @Eli Bixby for your answer, but sorry there might be some mistakes. For BN, there are not only the `statistics` that should be accumulated, or updated. The gradients each time back-propagated should be also considered. If small batch size is used, the gradients are not stable at all. What you proposed here, is just for forwarding, the backward is computed by auto-differentiation and is not correctly done. – LI Xuhong Jan 11 '18 at 16:04
  • You should be using a large enough batch size that each GPU is saturated. In our experience this is large enough even when sharded between GPUs to produce stable gradients for batch_norm by only propagating gradients from a single GPU shard in our experience (see the code I linked). Looking into it AFAICT there's no way to average the batch norm gradients before application with the current high level batch norm functions, you'd have to implement that yourself in low level TF. – Eli Bixby Jan 16 '18 at 22:40
  • @Eli Bixby Thanks. I cope with segmentation tasks that usually consume a lot of memory, so I cannot use a large batch size. Do you have any experience of implementation in low level TF, or some concrete examples? Better there is the implementation of computing gradients. – LI Xuhong Jan 23 '18 at 16:51
  • I don't have any experience trying to do this specific task manually, however it should follow the exact same pattern as manually averaging gradients except instead of averaging gradients, you're averaging the batch norm statistics, in order to calculate a gradient. So you'll need to create a number of batch_norm ops manually pegged to a device, then use the list of ops to extract averaged statistics. This will be analogous to the code I linked above (which does it for gradients). Sorry I couldn't be more helpful! I might try this myself when I get some time. – Eli Bixby Jan 24 '18 at 22:17
2

A specialized keras layer SyncBatchNormalization is available Since TF2.2 https://www.tensorflow.org/api_docs/python/tf/keras/layers/experimental/SyncBatchNormalization

0

I've figured out a way to implement sync batch norm in pure tensorflow and pure python.

The code makes it possible to train PSPNet on Cityscapes and get comparable performance.

LI Xuhong
  • 2,339
  • 2
  • 17
  • 32