11

I have a 4 GPU machine on which I run Tensorflow (GPU) with Keras. Some of my classification problems take several hours to complete.

nvidia-smi returns Volatile GPU-Util which never exceeds 25% on any of my 4 GPUs. How can I increase GPU Util% and speed up my training? NVIDIA GPU Util

  • 2
    According to the [Performance Guide](https://www.tensorflow.org/performance/performance_guide), the input pipeline could be the bottleneck. – kww Nov 16 '17 at 00:24

2 Answers2

25

If your GPU util is below 80%, this is generally the sign of an input pipeline bottleneck. What this means is that the GPU sits idle much of the time, waiting for the CPU to prepare the data:enter image description here

What you want is the CPU to keep preparing batches while the GPU is training to keep the GPU fed. This is called prefetching:enter image description here

Great, but if the batch preparation is still way longer than the model training, the GPU will still remain idle, waiting for the CPU to finish the next batch. To make the batch preparation faster we can parallelize the different preprocessing operations: enter image description here

We can go even further by parallelizing I/O: enter image description here

Now to implement this in Keras, you need to use the Tensorflow Data API with Tensorflow version >= 1.9.0. Here is an example:

Let's assume, for the sake of this example that you have two numpy arrays x and y. You can use tf.data for any type of data but this is simpler to understand.

def preprocessing(x, y):
     # Can only contain TF operations
     ...
     return x, y

dataset = tf.data.Dataset.from_tensor_slices((x, y)) # Creates a dataset object 
dataset = dataset.map(preprocessing, num_parallel_calls=64) # parallel preprocessing
dataset = dataset.batch(batch_size)
dataset = dataset.prefetch(None) # Will automatically prefetch batches

....

model = tf.keras.model(...)
model.fit(x=dataset) # Since tf 1.9.0 you can pass a dataset object

tf.data is very flexible, but as anything in Tensorflow (except eager), it uses a static graph. This can be a pain sometimes but the speed up is worth it.

To go further, you can have a look at the performance guide and the Tensorflow data guide.

Olivier Dehaene
  • 1,620
  • 11
  • 15
4

I've got similar issue - the memory of all the GPUs were allocated by Keras, but Volatile was around 0% and training was taking almost the same amount of time as on CPU. I was using ImageDataGenerator, which turned out to be a bottleneck. When I increased the number of workers in fit_generator method from default value 1 to all available CPUs, then the training time dropped rapidly.

You can also load the data to the memory and then use flow method to prepare batches with augmented images.

Konrad
  • 552
  • 4
  • 23
  • Hi @Konrad can you please elaborate on how you increased the number of workers in flow_from_directory.. – Sharanya Arcot Desai Sep 13 '18 at 00:03
  • @SharanyaArcotDesai, by mistake I mentioned `flow_from_directory`, but workers number is set as a parameter of `fit_generator` method. Sorry about this, I will update also my answer above. – Konrad Sep 13 '18 at 08:37
  • @Konrad Thank you. your fix helped me to some extent. However I dont see continuous utilization of GPU though. Currently I have set the batch size as 32, should I reduce it? – KK2491 Apr 29 '19 at 12:57
  • @KK2491 if your batch is fitting into GPU memory - I would leave it as it is. As [Olivier mentioned in his answer](https://stackoverflow.com/a/52273471/2585410), if GPU is not utilised in ~80%, it is probably caused by an input bottle neck - CPU is not capable to prepare the data for the GPU on time. So you would probably have to do some preprocessing optimisation magic to utilise GPU fully. – Konrad Apr 30 '19 at 19:32
  • @Konrad any suggestions on preprocessing optimization techniques? – KK2491 May 02 '19 at 14:05
  • @KK2491, unfortunately, for now I am not able to help in this topic – Konrad May 06 '19 at 06:01