MirroredVariable has different values on replicas (zeros, except on one device)

Question

Minimal example to demonstrate the problem:

import tensorflow as tf
    
with tf.distribute.MirroredStrategy().scope():
    print(tf.Variable(1.))

Output on a 4-GPU server:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
MirroredVariable:{
  0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
  1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=0.0>,
  2: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>,
  3: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>
}

The problem is, as seen above, that the replicas do not contain the correct variable value, all are zero values except on the first device (the numpy=0.0 parts). This is the same with 2 or 3 devices as well, not just with all 4.

The same code does produce the expected behavior on a different machine.

Correct output on a different, 2-GPU workstation:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
MirroredVariable:{
  0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
  1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>
}

(Note the value 1.0 on both devices)

The problematic machine is a Dell PowerEdge R750xa with 4x Nvidia A40 GPUs.

The correctly working machine has 2x Titan RTX.

Software config on both:

Ubuntu 18.04
CUDA 11.4
cuDNN 8.2.4
TensorFlow 2.6.0

What could be the reason for such behavior? Glad to provide more details.

MirroredVariable has different values on replicas (zeros, except on one device)

0 Answers0