5

I have this warning when running a keras multiple input model using functional API. The model works fine and without warnings when runing on single GPU. When I use tf.distribute.MirroredStrategy to use two GPUs the final result of the model is ok but I get the warning. I think this can lead to performance issues?

tf.__version__ : 2.2.0
tf.keras.__version__ : 2.3.0-tf
NVIDIA-SMI 410.72       Driver Version: 410.72       CUDA Version: 10.1

The model I am generating is:

def build_model_():

input_a_size = 200
input_b_size = 4
num_classes = 2
len_embedding = 100

mirrored_strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])

with mirrored_strategy.scope():

    input_a = Input(shape=(input_a_size,), name='input_a', dtype=np.uint8)
    input_b = Input(shape=(input_b_size,), name='input_b', dtype=np.float32)

    x = Embedding(len_embedding, 100)(input_a)
    x = Conv1D(32, 4, activation='relu')(x)
    x = Flatten()(x)
    branch_a = Dense(64, activation='relu')(x)

    x = Dense(32, activation='relu')(input_b)
    branch_b = Dense(32, activation='relu')(x)

    concat = Concatenate()([
                            branch_a,
                            branch_b,
                           ])

    x = Dense(256, activation = 'relu')(concat)
    output = Dense(num_classes, activation='softmax')(x)

    model = Model(inputs=[
                          input_a,
                          input_b,
                         ],
                  outputs=[output])
        
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

model.summary()

return model

Model summary:

INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
Model: "model"
__________________________________________________________________________________________________
Layer (type)                    Output Shape         Param #     Connected to                     
==================================================================================================
input_a (InputLayer)            [(None, 200)]        0                                            
__________________________________________________________________________________________________
embedding (Embedding)           (None, 200, 100)     10000       input_a[0][0]                    
__________________________________________________________________________________________________
conv1d (Conv1D)                 (None, 197, 128)     51328       embedding[0][0]                  
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D)    (None, 49, 128)      0           conv1d[0][0]                     
__________________________________________________________________________________________________
input_b (InputLayer)            [(None, 4)]          0                                            
__________________________________________________________________________________________________
flatten (Flatten)               (None, 6272)         0           max_pooling1d[0][0]              
__________________________________________________________________________________________________
dense_1 (Dense)                 (None, 32)           160         input_b[0][0]                    
__________________________________________________________________________________________________
dense (Dense)                   (None, 64)           401472      flatten[0][0]                    
__________________________________________________________________________________________________
dense_2 (Dense)                 (None, 32)           1056        dense_1[0][0]                    
__________________________________________________________________________________________________
concatenate (Concatenate)       (None, 96)           0           dense[0][0]                      
                                                                 dense_2[0][0]                    
__________________________________________________________________________________________________
dense_3 (Dense)                 (None, 256)          24832       concatenate[0][0]                
__________________________________________________________________________________________________
dense_4 (Dense)                 (None, 2)            514         dense_3[0][0]                    
==================================================================================================
Total params: 489,362
Trainable params: 489,362
Non-trainable params: 0
__________________________________________________________________________________________________

The way I am generating the inputs:

input_a_train.shape: (35000, 200)
input_b_train.shape: (35000, 4)
y_train.shape: (35000, 2)

train_dataset = tf.data.Dataset.from_tensor_slices(({
                                                     "input_a": input_a_train,
                                                     "input_b": input_b_train,
                                                     }, y_train))
<TensorSliceDataset shapes: ({input_a: (200,), input_b: (4,)}, (2,)), types: ({input_a: tf.uint8, input_b: tf.float64}, tf.float32)>

val_dataset = tf.data.Dataset.from_tensor_slices(({
                                                     "input_a": input_a_val,
                                                     "input_b": input_b_val,
                                                     }, y_val))
<TensorSliceDataset shapes: ({input_a: (200,), input_b: (4,)}, (2,)), types: ({input_a: tf.uint8, input_b: tf.float64}, tf.float32)>

train_batches = train_dataset.padded_batch(128)
val_batches = val_dataset.padded_batch(128)

I get the warning during the training phase,

history = my_model.fit(
    x = train_batches,
    epochs=3,
    verbose = 1,
    validation_data = val_batches,
)

Here is the output:

Epoch 1/3
INFO:tensorflow:batch_all_reduce: 12 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 12 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
274/274 [==============================] - ETA: 0s - loss: 0.1857 - accuracy: 0.9324
...

There is a similar question here: Efficient allreduce is not supported for 2 IndexedSlices but it has no answers.

EDIT 1 (31/7/2020)

Implemented custom training loop as described in this guides:

https://www.tensorflow.org/tutorials/distribute/custom_training

https://www.tensorflow.org/guide/distributed_training#using_tfdistributestrategy_with_custom_training_loops

https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough

https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch

Same warning, and same behaviour on multiple GPUs. Performance decreases while I increase number of GPUs. Training on 1 GPU is faster than training in 2 GPUs, with worst case using 8 GPUs. I thought the problem could be in the keras.model.fit method, but no. My guess is a problem on the input data format for multi-input keras using functional api model.

1 Answers1

4

I solved it using tf.distribute.experimental.MultiWorkerMirroredStrategy(). This strategy has better support for handling IndexedSlices, as linked here: https://github.com/tensorflow/tensorflow/issues/41898#issuecomment-668786507 . Also, increasing the batch size according to the number of GPUs used. This issue is covering the problem, https://github.com/tensorflow/tensorflow/issues/41898

physical_devices = tf.config.list_physical_devices('GPU') # 8 GPUs in my setup
tf.config.set_visible_devices(physical_devices[0:8], 'GPU') # Using all GPUs (default behaviour)
strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy()

BATCH_SIZE_PER_REPLICA = 1024
GLOBAL_BATCH_SIZE = BATCH_SIZE_PER_REPLICA * strategy.num_replicas_in_sync

train_batches = train_dataset.batch(GLOBAL_BATCH_SIZE)

with strategy.scope():
    model = build_model_()
    model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])

history = model.fit(
    x = train_batches,
    epochs=10,
    verbose = 1,
)

  • 1
    what exactly does IndexedSlices mean? can you please explain this in a little detail? – n0obcoder Nov 12 '20 at 09:11
  • Hi @Santiago! To make the best use of multi GPUs, I should replace strategy = tf.distribute.MirroredStrategy() with strategy = tf.distribute.experimental.MultiWorkerMirroredStrategy() in my code? That worked for you without changing or adding any other parameters? – n0obcoder Nov 12 '20 at 09:20
  • thanks a lot! it helped us on AWS EC2 instance with 8 V100 GPUs. For 4 GPUs MirroredStrategy worked fine, but for 8 MultiWorkerMirroredStrategy became a real lifesaver! – Anatoly Alekseev Dec 01 '20 at 17:03
  • 1
    https://www.tensorflow.org/guide/distributed_training#mirroredstrategy acc to documentation mirroredstrategy is to be used for 1node aka 1 worker; and multi-worker mirrored strategy for multi-node/multi-worker. @AnatolyAlekseev in your case 8 V100 gpus were on 1 node or multi-node? whch ec2 instance was it? if it was 1 instance, using either of 2 strategies should give you same result right? – Chaitanya Bapat Jun 03 '21 at 03:28
  • 1
    Nevermind, read the linked github issue: https://github.com/tensorflow/tensorflow/issues/41898 which talks about the shortcoming. Thanks – Chaitanya Bapat Jun 03 '21 at 03:31
  • it's a cool explanation I don't have technical knowledge about this but, I am implementing it on K80 Nvidia card. But I run the code kernel automatically dead? any solution – Engr Ali Mar 16 '22 at 10:10