Why is GeForce GTX 1080 Ti slower than Quadro K1200 on training a RNN model?

Question

Problem type: regression

Inputs: sequence length varies from 14 to 39, each sequence point is a 4-element vector.

Output: a scalar

Neural Network: 3-layer Bi-LSTM (hidden vector size: 200) followed by 2 Fully Connected layers

Batch Size: 30

Number of samples per epoch: ~7,000

TensorFlow version: tf-nightly-gpu 1.6.0-dev20180112

CUDA version: 9.0

CuDNN version: 7

Details of the two GPUs:

GPU 0: name: GeForce GTX 1080 Ti major: 6 minor: 1 memoryClockRate(GHz): 1.582 totalMemory: 11.00GiB freeMemory: 10.72GiB

device_placement_log_0.txt

nvidia-smi during the run (using 1080 Ti only):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.69                 Driver Version: 385.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108... WDDM  | 00000000:02:00.0 Off |                  N/A |
| 20%   37C    P2    58W / 250W |  10750MiB / 11264MiB |     10%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K1200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 39%   35C    P8     1W /  31W |    751MiB /  4096MiB |      2%      Default |
+-------------------------------+----------------------+----------------------+

GPU 1: name: Quadro K1200 major: 5 minor: 0 memoryClockRate(GHz): 1.0325 totalMemory: 4.00GiB freeMemory: 3.44GiB

device_placement_log_1.txt

nvidia-smi during the run (using K1200 only):

+-----------------------------------------------------------------------------+
| NVIDIA-SMI 385.69                 Driver Version: 385.69                    |
|-------------------------------+----------------------+----------------------+
| GPU  Name            TCC/WDDM | Bus-Id        Disp.A | Volatile Uncorr. ECC |
| Fan  Temp  Perf  Pwr:Usage/Cap|         Memory-Usage | GPU-Util  Compute M. |
|===============================+======================+======================|
|   0  GeForce GTX 108... WDDM  | 00000000:02:00.0 Off |                  N/A |
| 20%   29C    P8     8W / 250W |    136MiB / 11264MiB |      0%      Default |
+-------------------------------+----------------------+----------------------+
|   1  Quadro K1200       WDDM  | 00000000:03:00.0  On |                  N/A |
| 39%   42C    P0     6W /  31W |   3689MiB /  4096MiB |     23%      Default |
+-------------------------------+----------------------+----------------------+

Time spent for 1 epoch:

GPU 0 only (set environment var "CUDA_VISIBLE_DEVICES"=0): ~60 minutes

GPU 1 only (set environment var "CUDA_VISIBLE_DEVICES"=1): ~45 minutes

Set env. var. to "TF_MIN_GPU_MULTIPROCESSOR_COUNT=4" during both tests.

Why is the better GPU (GeForce GTX 1080 Ti) slower on training my neural network?

Thanks in advance.

Update

Another set of tests on MNIST dataset using a CNN model showed the same pattern:

Time spent for training 17 epochs:

GPU 0 (1080 Ti): ~59 minutes

GPU 1 (K1200): ~45 minutes

@AlexandrePassos, Yes, the Quadro K1200 was used for graphics (two mornitors, resolutions: 1920x1200 and 1280x1024). The GeForce GTX 1080 Ti was not used for graphics or activities other than training the model. — Maosi Chen, Jan 16 '18 at 17:48
One out of two options: (1) TF is deciding which GPU is 0 and which is 1 different from nvidia (look in the tf startup logs to see what it decides) or (2) this particular model is faster on the CPU than on the GPU (tf by default won't run on quadro k1200 because there is not enough compute capacity on it). Can you log device placement to see? — Alexandre Passos, Jan 17 '18 at 18:29
@AlexandrePassos, Thanks for the options. The text under my question "details of the two GPUs" (i.e. "name : GeForce GTX 1080 Ti major: 6 minor: 1... ") was copied from tf screen logs, which tells us the env. var. setting of "CUDA_VISIBLE_DEVICES" is working (pointing to the desired single GPU for each test). I just checked that quadro k1200 has a Compute Capability of 5.0, which is larger than the required value (3.0) to run GPU version of Tensorflow. Besides, I didn't show in the question, I saw the temperature change of the target GPU (increasing from 30 to 60 degree C) during the two tests — Maosi Chen, Jan 17 '18 at 19:56
Can you [log the op device placement](https://www.tensorflow.org/tutorials/using_gpu#logging_device_placement) to confirm that when you're using the slower GPU the GPU is actually being used? — Alexandre Passos, Jan 17 '18 at 21:03
@AlexandrePassos, thanks for the tutorial for logging device placement. I have uploaded the two log files in the question part under the two pictures. Most of the nodes/ops are placed on GPU with only a few exceptions. — Maosi Chen, Jan 17 '18 at 22:03
Forgot to mention, in order to do these two tests, I have to set env. var. to "TF_MIN_GPU_MULTIPROCESSOR_COUNT=4" during both tests. — Maosi Chen, Jan 18 '18 at 18:04
One suggestion, post your code as well. Maybe some operation you are doing is very slow. On a side note, it sounds like you're doing regression on DNA/RNA. For your input, your model is definitely over-specified. 3 layer/200 node LSTM is way too much, I'd suggest starting with a single layer with 64 nodes and optimizing from there. — thc, Jan 19 '18 at 00:29
@thc, My full code is too long (over 1,000 lines) to post here. I have posted the comparison results for training MNIST with a CNN model (https://github.com/martin-gorner/tensorflow-mnist-tutorial/blob/master/mnist_4.2_batchnorm_convolutional.py). I've added it under the "Update " at the bottom of the question. The finding is similar: the 1080 Ti still spent more time on training. — Maosi Chen, Jan 19 '18 at 00:53
Can you show us what happens if you run nvidia-smi during the computation with both GPUs? — Alexandre Passos, Jan 19 '18 at 19:10
@AlexandrePassos, I have added the nvidia-smi results during both runs (after creating the input buffer but only trained for less than 100 steps) in the question. Thanks. — Maosi Chen, Jan 19 '18 at 21:45
Thanks. The memory usage seems off (why is it using more memory from the 1080 if it's not serving graphics). I think stackoverflow is the wrong forum for this; can you file a github issue with a shorter script we can run which reproduces the problem in your machine? Thanks — Alexandre Passos, Jan 23 '18 at 17:52
@AlexandrePassos, I think TensorFlow by default tries to use all possible GPU memories no matter how much they actually need (https://stackoverflow.com/questions/34199233/how-to-prevent-tensorflow-from-allocating-the-totality-of-a-gpu-memory). I have tested the two GPUs with a CNN code on training MNIST, which showed the same pattern. Please find the link of the CNN code in the **Update** part of my question (at the bottom). Thanks. — Maosi Chen, Jan 23 '18 at 18:00

score 0 · Answer 1 · answered Jan 30 '18 at 22:46

The official tensorflow document has the section "Allowing GPU memory growth" introducing two session options to control GPU memory allocation. I tried them separately to train my RNN model (using only GeForce GTX 1080 Ti):

config.gpu_options.allow_growth = True and
config.gpu_options.per_process_gpu_memory_fraction = 0.05

Both of them shortened the training time from the original ~60 minutes per epoch to ~42 minutes per epoch. I still don't understand why this helps. If you can explain it, I will accept that as the answer. Thanks.

Why is GeForce GTX 1080 Ti slower than Quadro K1200 on training a RNN model?

1 Answers1

Linked