SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_model

Question

Running AWS SageMaker with a custom model, the TrainingJob fails with an Algorithm Error when using Keras plus a Tensorflow backend in multi-gpu configuration:

from keras.utils import multi_gpu_model

parallel_model = multi_gpu_model(model, gpus=K)
parallel_model.compile(loss='categorical_crossentropy',
optimizer='rmsprop')
parallel_model.fit(x, y, epochs=20, batch_size=256)

This simple parallel model loading will fail. There is no further error or exception from CloudWatch logging. This configuration works properly on local machine with 2x NVIDIA GTX 1080, same Keras Tensorflow backend.

According to SageMaker documentation and tutorials the multi_gpu_model utility will work ok when Keras backend is MXNet, but I did not find any mention when the backend is Tensorflow with the same multi gpu configuration.

[UPDATE]

I have updated the code with the suggested answer below, and I'm adding some logging before the TrainingJob hangs

This logging repeats twice

2018-11-27 10:02:49.878414: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1511] Adding visible gpu devices: 0, 1, 2, 3
2018-11-27 10:02:49.878462: I tensorflow/core/common_runtime/gpu/gpu_device.cc:982] Device interconnect StreamExecutor with strength 1 edge matrix:
2018-11-27 10:02:49.878471: I tensorflow/core/common_runtime/gpu/gpu_device.cc:988] 0 1 2 3
2018-11-27 10:02:49.878477: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 0: N Y Y Y
2018-11-27 10:02:49.878481: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 1: Y N Y Y
2018-11-27 10:02:49.878486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 2: Y Y N Y
2018-11-27 10:02:49.878492: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1001] 3: Y Y Y N
2018-11-27 10:02:49.879340: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:0 with 14874 MB memory) -> physical GPU (device: 0, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1b.0, compute capability: 7.0)
2018-11-27 10:02:49.879486: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:1 with 14874 MB memory) -> physical GPU (device: 1, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1c.0, compute capability: 7.0)
2018-11-27 10:02:49.879694: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:2 with 14874 MB memory) -> physical GPU (device: 2, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1d.0, compute capability: 7.0)
2018-11-27 10:02:49.879872: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1115] Created TensorFlow device (/device:GPU:3 with 14874 MB memory) -> physical GPU (device: 3, name: Tesla V100-SXM2-16GB, pci bus id: 0000:00:1e.0, compute capability: 7.0)

Before there is some logging info about each GPU, that repeats 4 times

2018-11-27 10:02:46.447639: I tensorflow/core/common_runtime/gpu/gpu_device.cc:1432] Found device 3 with properties:
name: Tesla V100-SXM2-16GB major: 7 minor: 0 memoryClockRate(GHz): 1.53
pciBusID: 0000:00:1e.0
totalMemory: 15.78GiB freeMemory: 15.37GiB

According to the logging all the 4 GPUs are visible and loaded in the Tensorflow Keras backend. After that no application logging follows, the TrainingJob status is inProgress for a while, after that it becomes Failed with the same Algorithm Error.

Looking at CloudWatch logging I can see some metrics at work. Specifically GPU Memory Utilization, CPU Utilization are ok, while GPU utilization is 0%.

[UPDATE]

Due to a known bug on Keras that is about saving a multi gpu model, I'm using this override of the multi_gpu_model utility in keras.utils

from keras.layers import Lambda, concatenate
from keras import Model
import tensorflow as tf
    
def multi_gpu_model(model, gpus):
    #source: https://github.com/keras-team/keras/issues/8123#issuecomment-354857044
  if isinstance(gpus, (list, tuple)):
    num_gpus = len(gpus)
    target_gpu_ids = gpus
  else:
    num_gpus = gpus
    target_gpu_ids = range(num_gpus)

  def get_slice(data, i, parts):
    shape = tf.shape(data)
    batch_size = shape[:1]
    input_shape = shape[1:]
    step = batch_size // parts
    if i == num_gpus - 1:
      size = batch_size - step * i
    else:
      size = step
    size = tf.concat([size, input_shape], axis=0)
    stride = tf.concat([step, input_shape * 0], axis=0)
    start = stride * i
    return tf.slice(data, start, size)

  all_outputs = []
  for i in range(len(model.outputs)):
    all_outputs.append([])

  # Place a copy of the model on each GPU,
  # each getting a slice of the inputs.
  for i, gpu_id in enumerate(target_gpu_ids):
    with tf.device('/gpu:%d' % gpu_id):
      with tf.name_scope('replica_%d' % gpu_id):
        inputs = []
        # Retrieve a slice of the input.
        for x in model.inputs:
          input_shape = tuple(x.get_shape().as_list())[1:]
          slice_i = Lambda(get_slice,
                           output_shape=input_shape,
                           arguments={'i': i,
                                      'parts': num_gpus})(x)
          inputs.append(slice_i)

        # Apply model on slice
        # (creating a model replica on the target device).
        outputs = model(inputs)
        if not isinstance(outputs, list):
          outputs = [outputs]

        # Save the outputs for merging back together later.
        for o in range(len(outputs)):
          all_outputs[o].append(outputs[o])

  # Merge outputs on CPU.
  with tf.device('/cpu:0'):
    merged = []
    for name, outputs in zip(model.output_names, all_outputs):
      merged.append(concatenate(outputs,
                                axis=0, name=name))
    return Model(model.inputs, merged)

This works ok on local 2x NVIDIA GTX 1080 / Intel Xeon / Ubuntu 16.04. It will fails on SageMaker Training Job.

I have posted this issue on AWS Sagemaker forum in

[UPDATE]

I have slightly modified the tf.session code adding some initializers

with tf.Session() as session:
    K.set_session(session)
    session.run(tf.global_variables_initializer())
    session.run(tf.tables_initializer())

and now at least I can see that one GPU (I assume device gpu:0) is used from the instance metrics. The multi-gpu does not work anyways.

score 3 · Accepted Answer · answered Nov 26 '18 at 22:07

This might not be the best answer for your problem, but this is what I am using for a multi-gpu model with Tensorflow backend. First i initialize using:

def setup_multi_gpus():
    """
    Setup multi GPU usage

    Example usage:
    model = Sequential()
    ...
    multi_model = multi_gpu_model(model, gpus=num_gpu)
    multi_model.fit()

    About memory usage:
    https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory
    """
    import tensorflow as tf
    from keras.utils.training_utils import multi_gpu_model
    from tensorflow.python.client import device_lib

    # IMPORTANT: Tells tf to not occupy a specific amount of memory
    from keras.backend.tensorflow_backend import set_session  
    config = tf.ConfigProto()  
    config.gpu_options.allow_growth = True  # dynamically grow the memory used on the GPU  
    sess = tf.Session(config=config)  
    set_session(sess)  # set this TensorFlow session as the default session for Keras.


    # getting the number of GPUs 
    def get_available_gpus():
       local_device_protos = device_lib.list_local_devices()
       return [x.name for x in local_device_protos if x.device_type    == 'GPU']

    num_gpu = len(get_available_gpus())
    print('Amount of GPUs available: %s' % num_gpu)

    return num_gpu

Then i call

# Setup multi GPU usage
num_gpu = setup_multi_gpus()

and create a model.

...

After which you're able to make it a multi GPU model.

multi_model = multi_gpu_model(model, gpus=num_gpu)
multi_model.compile...
multi_model.fit...

The only thing here that is different from what you are doing is the way Tensorflow is initializing the GPU's. I can't imagine it being the problem, but it might be worth trying out.

Good luck!

Edit: I noticed sequence to sequence not being able to work with multi GPU. Is that the type of model you are trying to train?

Thank you. The memory issue could be a good point. The model is a `Sequential` model, so it is your case. Please check the logging above, it seems it still hangs, while it reads the GPUs devices correctly. — loretoparisi, Nov 27 '18 at 10:14
I have updated this issue with the code that will solve a related bug in multi-gpu model saving. By the way it does not fix the main issue. — loretoparisi, Nov 30 '18 at 10:02
Going to accept this answer since I had no other solution proposed and this ons is more closer to the solution! thank you. — loretoparisi, Jan 08 '19 at 13:25

ByungWook · Answer 2 · 2018-12-13T03:13:33.243

1

I apologize for the slow response.

It seems there are a lot of threads that are running in parallel, and I want to link them together, so that other individuals who have the same issue can see the progress and discussion going on.

https://forums.aws.amazon.com/thread.jspa?messageID=881541 https://forums.aws.amazon.com/thread.jspa?messageID=881540

https://github.com/aws/sagemaker-python-sdk/issues/512

There a few questions in regards to this.

What version of TensorFlow and Keras?

I am not too sure what is causing this problem. Does your container have all of the needed dependencies such as CUDA and etc? https://www.tensorflow.org/install/gpu

Were you able to train using single GPU with Keras?

edited Dec 13 '18 at 03:13

answered Dec 13 '18 at 03:01

ByungWook

374
1
4

Thank you. I have discussed this in the sagemaker-python-sdk issue, we are using `Keras 2.2.0` and `Tensorflow 1.12.0`. With a slightly updated `tf.session` settings a single GPU will work (See update above). – loretoparisi Dec 13 '18 at 14:30
1

Awesome! Thanks for that information. Would it be okay if we keep the discussion in a single thread for now? Where would you prefer to continue this conversation? – ByungWook Dec 13 '18 at 19:31
1

I think best place is SF. All the community will get a valuable help here more than any other place. – loretoparisi Dec 13 '18 at 19:34
Agreed. Which instance type are you using? How many instances are you using? Would it be possible to see the Python code that invokes SageMaker on your behalf? Were there any notebook examples or code you are following for attempting to do multi-gpu with keras on SageMaker? – ByungWook Dec 13 '18 at 19:58
Hello the instance type was a ml.P3.8xlarge and a P3.2xlarge, so a 4 or 2 GPUs instance. It's not so easy to produce a working example, by the way it's basically a variation of this CNN https://github.com/keunwoochoi/music-auto_tagging-keras/tree/master/compact_cnn – loretoparisi Dec 18 '18 at 18:01
Are you doing multi-instance multi-gpu? When you train outside of SageMaker, can you explain your environment? I understand that you are using a custom TensorFlow image, however are you using CUDA? I believe your code looks correct, this might be more of an interface issue? – ByungWook Dec 18 '18 at 20:32
Yes. We are using docker image from `tensorflow-gpu:latest`. This means that we use CUDA 9 of course. We do not distributed training on gpu, so once instance/multi-gpu. The `keras` code can see the gpu cuda bindings (from the sagemaker logs we can see this). We can run single gpu with latest changes (see update above). But with the `setup_multi_gpus` fixes (keras issue) it works on prem but not on Sagemaker basically. – loretoparisi Dec 19 '18 at 11:51

SageMaker fails when using Multi-GPU with keras.utils.multi_gpu_model

2 Answers2

Linked