Training Keras model on EC2 instance with GPU is slow

Question

I've been playing around with creating convolutional neural networks using keras. I've gotten some decent results but training on my laptop can train hours, so figured I could speed things up using a GPU instance at AWS. I spun up a g3dn.2xlarge and assumed I would see training fly through. Instead, I see steps moving as slow as they did on my laptop.

I used the following tutorial to setup my instance and start Jupyter notebook server: https://aws.amazon.com/getting-started/hands-on/get-started-dlami/

After opening jupyter in my browser, I set it to conda_tensorflow2_p36 environment. After importing tensorflow/keras libraries, I have a line like this:

print("Num GPUs Available: ", len(tf.config.experimental.list_physical_devices('GPU')))

It confirms that there is 1 GPU available. I don't think what I have going is particularly big, my training set is 42950 images belonging to 859 classes, my validation set is 21463 images belonging to 859 classes. So more or less 50 images per class, which I know is not a lot but I'm getting accurate enough results on my PC so not worried about that. Batch size is 25 for training set and 10 for validation set. Independent of how the convolutional neural network is architected, I would assume it would run faster on the EC2 instance than on my crappy laptop. What could I be doing wrong here?

Update 1 I'm running the notebook on an EBS volume and the data is there as well, could that be a problem?

Update 2
I've added the following lines to test:

print(device_lib.list_local_devices())
import keras.backend.tensorflow_backend as tfback
tfback._get_available_gpus()

Output is this (seems to say GPU is being used):

[name: "/device:CPU:0"
device_type: "CPU"
memory_limit: 268435456
locality {
}
incarnation: 15554497242449630399
, name: "/device:XLA_CPU:0"
device_type: "XLA_CPU"
memory_limit: 17179869184
locality {
}
incarnation: 10944104228818506418
physical_device_desc: "device: XLA_CPU device"
, name: "/device:XLA_GPU:0"
device_type: "XLA_GPU"
memory_limit: 17179869184
locality {
}
incarnation: 6119898251220801089
physical_device_desc: "device: XLA_GPU device"
, name: "/device:GPU:0"
device_type: "GPU"
memory_limit: 14949928141
locality {
  bus_id: 1
  links {
  }
}
incarnation: 6277213019374881773
physical_device_desc: "device: 0, name: Tesla T4, pci bus id: 0000:00:1e.0, compute capability: 7.5"
]
['/device:GPU:0']

Update 3
Here's the code for the model:

And the code to start the training:

score 0 · Answer 1 · answered Jan 04 '21 at 21:46

0

Without knowing more about your code and configuration, a first step would be to check that keras is using the GPU. Executing the following code may provide more insight.

from keras import backend as K
K.tensorflow_backend._get_available_gpus()

I hope you find this other SO answer helpful, https://stackoverflow.com/a/45662992/14936140

answered Jan 04 '21 at 21:46

Harley Thomas

414
3
7

After adding the line above, I got the following error: AttributeError: module 'tensorflow_core._api.v2.config' has no attribute 'experimental_list_devices' – Rocket04 Jan 05 '21 at 15:52
Found a way to fix using this: https://stackoverflow.com/a/62976937/1012803 – Rocket04 Jan 05 '21 at 16:03

Training Keras model on EC2 instance with GPU is slow

1 Answers1