I have this warning when running a keras multiple input model using functional API. The model works fine and without warnings when runing on single GPU. When I use tf.distribute.MirroredStrategy
to use two GPUs the final result of the model is ok but I get the warning. I think this can lead to performance issues?
tf.__version__ : 2.2.0
tf.keras.__version__ : 2.3.0-tf
NVIDIA-SMI 410.72 Driver Version: 410.72 CUDA Version: 10.1
The model I am generating is:
def build_model_():
input_a_size = 200
input_b_size = 4
num_classes = 2
len_embedding = 100
mirrored_strategy = tf.distribute.MirroredStrategy(['/gpu:0', '/gpu:1'])
with mirrored_strategy.scope():
input_a = Input(shape=(input_a_size,), name='input_a', dtype=np.uint8)
input_b = Input(shape=(input_b_size,), name='input_b', dtype=np.float32)
x = Embedding(len_embedding, 100)(input_a)
x = Conv1D(32, 4, activation='relu')(x)
x = Flatten()(x)
branch_a = Dense(64, activation='relu')(x)
x = Dense(32, activation='relu')(input_b)
branch_b = Dense(32, activation='relu')(x)
concat = Concatenate()([
branch_a,
branch_b,
])
x = Dense(256, activation = 'relu')(concat)
output = Dense(num_classes, activation='softmax')(x)
model = Model(inputs=[
input_a,
input_b,
],
outputs=[output])
model.compile(loss = 'binary_crossentropy', optimizer = 'adam', metrics = ['accuracy'])
model.summary()
return model
Model summary:
INFO:tensorflow:Using MirroredStrategy with devices ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1')
Model: "model"
__________________________________________________________________________________________________
Layer (type) Output Shape Param # Connected to
==================================================================================================
input_a (InputLayer) [(None, 200)] 0
__________________________________________________________________________________________________
embedding (Embedding) (None, 200, 100) 10000 input_a[0][0]
__________________________________________________________________________________________________
conv1d (Conv1D) (None, 197, 128) 51328 embedding[0][0]
__________________________________________________________________________________________________
max_pooling1d (MaxPooling1D) (None, 49, 128) 0 conv1d[0][0]
__________________________________________________________________________________________________
input_b (InputLayer) [(None, 4)] 0
__________________________________________________________________________________________________
flatten (Flatten) (None, 6272) 0 max_pooling1d[0][0]
__________________________________________________________________________________________________
dense_1 (Dense) (None, 32) 160 input_b[0][0]
__________________________________________________________________________________________________
dense (Dense) (None, 64) 401472 flatten[0][0]
__________________________________________________________________________________________________
dense_2 (Dense) (None, 32) 1056 dense_1[0][0]
__________________________________________________________________________________________________
concatenate (Concatenate) (None, 96) 0 dense[0][0]
dense_2[0][0]
__________________________________________________________________________________________________
dense_3 (Dense) (None, 256) 24832 concatenate[0][0]
__________________________________________________________________________________________________
dense_4 (Dense) (None, 2) 514 dense_3[0][0]
==================================================================================================
Total params: 489,362
Trainable params: 489,362
Non-trainable params: 0
__________________________________________________________________________________________________
The way I am generating the inputs:
input_a_train.shape: (35000, 200)
input_b_train.shape: (35000, 4)
y_train.shape: (35000, 2)
train_dataset = tf.data.Dataset.from_tensor_slices(({
"input_a": input_a_train,
"input_b": input_b_train,
}, y_train))
<TensorSliceDataset shapes: ({input_a: (200,), input_b: (4,)}, (2,)), types: ({input_a: tf.uint8, input_b: tf.float64}, tf.float32)>
val_dataset = tf.data.Dataset.from_tensor_slices(({
"input_a": input_a_val,
"input_b": input_b_val,
}, y_val))
<TensorSliceDataset shapes: ({input_a: (200,), input_b: (4,)}, (2,)), types: ({input_a: tf.uint8, input_b: tf.float64}, tf.float32)>
train_batches = train_dataset.padded_batch(128)
val_batches = val_dataset.padded_batch(128)
I get the warning during the training phase,
history = my_model.fit(
x = train_batches,
epochs=3,
verbose = 1,
validation_data = val_batches,
)
Here is the output:
Epoch 1/3
INFO:tensorflow:batch_all_reduce: 12 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:batch_all_reduce: 12 all-reduces with algorithm = nccl, num_packs = 1
WARNING:tensorflow:Efficient allreduce is not supported for 1 IndexedSlices
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:GPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:GPU:0', '/job:localhost/replica:0/task:0/device:GPU:1').
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
INFO:tensorflow:Reduce to /job:localhost/replica:0/task:0/device:CPU:0 then broadcast to ('/job:localhost/replica:0/task:0/device:CPU:0',).
274/274 [==============================] - ETA: 0s - loss: 0.1857 - accuracy: 0.9324
...
There is a similar question here: Efficient allreduce is not supported for 2 IndexedSlices but it has no answers.
EDIT 1 (31/7/2020)
Implemented custom training loop as described in this guides:
https://www.tensorflow.org/tutorials/distribute/custom_training
https://www.tensorflow.org/tutorials/customization/custom_training_walkthrough
https://www.tensorflow.org/guide/keras/writing_a_training_loop_from_scratch
Same warning, and same behaviour on multiple GPUs. Performance decreases while I increase number of GPUs. Training on 1 GPU is faster than training in 2 GPUs, with worst case using 8 GPUs. I thought the problem could be in the keras.model.fit method, but no. My guess is a problem on the input data format for multi-input keras using functional api model.