6

I am running deep learning CNN (4-CNN layers and 3 FNN layers) model (written in Keras with tensorflow as backend) on two different machines.

I have 2 machines (A: with a GTX 960 graphics GPU with 2GB memory & clock speed: 1.17 GHz and B: with a Tesla K40 computation GPU with 12GB memory & clock speed: 745MHz) But when I run the CNN model on A:

Epoch 1/35 50000/50000 [==============================] - 10s 198us/step - loss: 0.0851 - acc: 0.2323

on B:

Epoch 1/35 50000/50000 [==============================] - 43s 850us/step - loss: 0.0800 - acc: 0.3110

The numbers are not even comparable. I am quite new to deep learning and running code on GPUs. Could someone please help me explain why the numbers are so different?

  • Dataset: CIFAR-10 (32x32 RGB images)
  • Model batch size: 128
  • Model number of parameters: 1.2M
  • OS: Ubuntu 16.04
  • Nvidia driver version: 384.111
  • Cuda version: 7.5, V7.5.17

Please let me know if you need any more data.

Edit 1: (adding CPU info)

  • Machine A (GTX 960): 8 cores - Intel(R) Core(TM) i7-6700 CPU @ 3.40GHz
  • Machine B (Tesla K40c):8 cores - Intel(R) Xeon(R) CPU E5-2637 v4 @ 3.50GHz
talonmies
  • 70,661
  • 34
  • 192
  • 269
SUB
  • 287
  • 3
  • 14
  • What is 50000? Number of kernel launches? Can the cause of lag be kernel launch overhead by cuda version or hardware? What if data bandwidth is limited by pci-e? – huseyin tugrul buyukisik Feb 02 '18 at 19:21
  • 2
    Are you actually running a K40 on a local machine? Because if you are running a cloud instance, they might be throttling the speed. I have seen various people complain that cloud instances tend to be much slower than what you get when running something locally. – bremen_matt Feb 02 '18 at 19:26
  • Loss is calculated on the CPU. Can you please add the CPU details for machine A and machine B? It might explain everything. – Manngo Feb 03 '18 at 00:20
  • I have added the CPU info in the question. Both machines are local machines. How do I check for throttling on my machines? @bremen_matt – SUB Feb 03 '18 at 18:14
  • @huseyintugrulbuyukisik how do I check if the bandwidth is limited by the pci-e? 50,000 is the number of data samples (50000 images in this case). – SUB Feb 03 '18 at 18:16
  • I wouldn't even be concerned with pci-e at the moment. Something isn't adding up here. Make sure that the code is actually running on your gpu in both instances. Just as a sanity check, you can try running on both of your cpus as well. Obviously, if you don't see a difference between the CPU/GPU times, then there is a problem – bremen_matt Feb 03 '18 at 19:49
  • Also, one last thing... Make sure that you let this run a bunch of iterations. You want to make sure that no initialization times are not included in the timing. You may want to wait until the 2nd epoch finishes, and compare those timings. – bremen_matt Feb 03 '18 at 19:51
  • Hi @bremen_matt I did run on CPU and got a running time of 5 minutes for first 10 epochs. I also ran the CNN on GPU for 35 epochs and the time only went down by a few seconds for subsequent epochs. Machine A: 9 seconds and Machine B: 38-39 seconds. – SUB Feb 07 '18 at 17:39
  • I can only think of two more possibilities... – bremen_matt Feb 07 '18 at 17:57
  • 1. The cudnn or tensorflow version you have is better optimized for the gtx. You could try upgrading tensorflow and cuda, and see if that impacts the performance of either – bremen_matt Feb 07 '18 at 17:59
  • 2. Gpus are really good at handling floats, but generally bad at handling doubles. I would expect the k40 to be much better than the gtx, but something is strange here. You might try changing the data types of your variables to floats and see if they give similar performance then. – bremen_matt Feb 07 '18 at 18:10
  • I had a similar problem: the better GPU (GTX 1080 Ti) is slower than Quadra K1200. I cannot explain why, but the following config options helped my case `config.gpu_options.per_process_gpu_memory_fraction` or `config.gpu_options.allow_growth = True` (https://stackoverflow.com/questions/48236274/why-is-geforce-gtx-1080-ti-slower-than-quadro-k1200-on-training-a-rnn-model) – Maosi Chen Feb 07 '18 at 22:19
  • @bremen_matt I have changed the data-type to float32 but it didn't help. However, I will try upgrading/re-installing the cudnn drivers. – SUB Feb 11 '18 at 01:13
  • @MaosiChen your commands do help speed things up. (on both servers). I get a time of A: 9 sec and on B: 28 sec. However, there is still a difference in performance. – SUB Feb 11 '18 at 01:14
  • You may try to profile your runs with tensorflow profiler (https://github.com/tensorflow/tensorflow/blob/master/tensorflow/core/profiler/README.md) to see which name scope costs most time. – Maosi Chen Feb 11 '18 at 03:08
  • I just noticed that your Xeon processor doesnt have graphics. That means that the Gpu in setup B has to handle both rendering and your tensorflow code. In setup A, the rendering might be happening on the cpu. That might explain a lot. You should try dropping your screen resolution as much as possible when running the tests and closing any other programs. It could be the Gpu B has a much higher workload due to this – bremen_matt Feb 11 '18 at 04:10

1 Answers1

1

TL;DR: Measure again with a larger batch size.

Those results do not surprise me much. It's a common mistake to think that an expensive Tesla card (or a GPU for that matter) will automatically do everything faster. You have to understand how GPUs work in order to harness their power.

If you compare the base clock speeds of your devices, you will find that your Xeon CPU has the fastest one:

  • Nvidia K40c: 745MHz
  • Nvidia GTX 960: 1127MHz
  • Intel i7: 3400MHz
  • Intel Xeon: 3500MHz

This gives you a hint of the speeds at which these devices operate and gives a very rough estimate of how fast they can crunch numbers if they would only do one thing at a time, that is, with no parallelization.

So as you see, GPUs are not fast at all (for some definition of fast), in fact they're quite slow. Also note how the K40c is in fact slower than the GTX 960. However, the real power of a GPU comes from its ability to process a lot of data simultaneously! If you now check again at how much parallelization is possible on these devices, you will find that your K40c is not so bad after all:

  • Nvidia K40c: 2880 cuda cores
  • Nvidia GTX 960: 1024 cuda cores
  • Intel i7: 8 threads
  • Intel Xeon: 8 threads

Again, these numbers give you a very rough estimate of how many things these devices can do simultaneously.

Note: I am severely simplifying things: In absolutely no way is a CPU core comparable to a cuda core! They are very very different things. And in no way can base clock frequencies be compared like this! It's just to give an idea of what's happening.

So, your devices needs to be able to process a lot of data in parallel in order to maximize their throughput. Luckily tensorflow already does this for you: It will automatically parallelize all those heavy matrix multiplications for maximum throughput. However this is only going to be fast if the matrices have a certain size. Your batch size is set to 128 which means that almost all of these matrices will have the first dimension set to 128. I don't know the details of your model, but if the other dimensions are not large either, then I suspect that most of your K40c is staying idle during those matrix multiplications. Try to increase the batch size and measure again. You should find that larger batch sizes will make the K40c faster in comparison with the GTX 960. The same should be true for increasing the model's capacity: increase the number of units in the fully-connected layers and the number of filters in the convolutional layers. Adding more layers will probably not help here. The output of the nvidia-smi tool is also very useful to see how busy a GPU really is.

Note however that changing the model's hyper-parameter and/or the batch size will of course have a huge impact on how the model is able to train successfully and naturally you might also hit memory limitations.

Perhaps if increasing the batch size or changing the model is not an option, you could also try to train two models on the K40c at the same time to make use of the idle cores. However I have never tried this, so it might not work at all.

jlh
  • 4,349
  • 40
  • 45