2

Minimal example to demonstrate the problem:

import tensorflow as tf
    
with tf.distribute.MirroredStrategy().scope():
    print(tf.Variable(1.))

Output on a 4-GPU server:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1', '/job:localhost/replica:0/task:0/device:GPU:2', '/job:localhost/replica:0/task:0/device:GPU:3')
MirroredVariable:{
  0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
  1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=0.0>,
  2: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>,
  3: <tf.Variable 'Variable/replica_2:0' shape=() dtype=float32, numpy=0.0>
}

The problem is, as seen above, that the replicas do not contain the correct variable value, all are zero values except on the first device (the numpy=0.0 parts). This is the same with 2 or 3 devices as well, not just with all 4.

The same code does produce the expected behavior on a different machine.

Correct output on a different, 2-GPU workstation:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
MirroredVariable:{
  0: <tf.Variable 'Variable:0' shape=() dtype=float32, numpy=1.0>,
  1: <tf.Variable 'Variable/replica_1:0' shape=() dtype=float32, numpy=1.0>
}

(Note the value 1.0 on both devices)


The problematic machine is a Dell PowerEdge R750xa with 4x Nvidia A40 GPUs.

The correctly working machine has 2x Titan RTX.

Software config on both:

  • Ubuntu 18.04
  • CUDA 11.4
  • cuDNN 8.2.4
  • TensorFlow 2.6.0

What could be the reason for such behavior? Glad to provide more details.

isarandi
  • 3,120
  • 25
  • 35

0 Answers0