2

How to create custom models built using tf.keras on TensorFlow 2.x that support distributed training (multiple GPU instances) in Amazon SageMaker?

E.g. using Distributed Data Parallel Library (DDPL)?

The documentation mentioned that tf.keras is not supported by DDPL library, so that shouldn't be an option. I've seen examples of distributed training using Horovod: https://sagemaker-examples.readthedocs.io/en/latest/aws_sagemaker_studio/frameworks/keras_pipe_mode_horovod/keras_pipe_mode_horovod_cifar10.html

juvchan
  • 6,113
  • 2
  • 22
  • 35

0 Answers0