Minimal example to demonstrate the problem:
import tensorflow as tf
with tf.distribute.MirroredStrategy().scope():
print(tf.Variable(1.))
Output on a 4-GPU server:
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
MirroredVariable:{
0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=0.0>,
2: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>,
3: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>
}
The problem is, as seen above, that the replicas do not contain the correct variable value, all are zero values except on the first device (the numpy=0.0
parts). This is the same with 2 or 3 devices as well, not just with all 4.
The same code does produce the expected behavior on a different machine.
Correct output on a different, 2-GPU workstation:
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
MirroredVariable:{
0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>
}
(Note the value 1.0 on both devices)
The problematic machine is a Dell PowerEdge R750xa with 4x Nvidia A40 GPUs.
The correctly working machine has 2x Titan RTX.
Software config on both:
- Ubuntu 18.04
- CUDA 11.4
- cuDNN 8.2.4
- TensorFlow 2.6.0
What could be the reason for such behavior? Glad to provide more details.