CUDA Cores and Streaming Multiprocessors Count for Inference Speed

Question

As far as I understand, the number of CUDA cores of an NVIDIA GPU determines how fast it can run a single deep learning model.

So, if I'm running inference on a model in 0.5 seconds with an NVIDIA TITAN RTX GPU, which has 72 streaming multiprocessors and 4608 cores, and it utilizes the GPU with a max utilization of ~10%, can I assume that 10% of the streaming multiprocessors (so roughly 7) are being used? So, therefore, roughly ~900 CUDA cores are being used? (I'm basing this off of this answer: https://superuser.com/questions/1109695/how-to-determine-number-of-gpu-cores-being-utilized-for-a-process)

As a result, if I downgrade to a lower GPU with 3000 CUDA cores, it should theoretically still be able to perform inference at the same 0.5 seconds speed right?

Robert Crovella · Answer 1 · 2020-01-17T15:34:53.143

That is not a correct interpretation of utilization. 10% utilization means, roughly speaking, 10% of the time, a GPU kernel is running. 90% of the time, no GPU kernel is running. It does not tell you anything about what that GPU kernel is doing, or how many resources it is using. The answer given on superuser is wrong. The correct description is here. It is possible, as indicated there, to demonstrate 100% utilization for a GPU kernel that only uses one "core" (ie. a kernel that is only using one thread).

Regarding your question, you should not assume that there will be no change in performance whatsoever if you switch from a GPU with 4608 cores to a GPU that has 3000 cores. First of all that is not enough information to judge performance (things like clock speed and other things matter), and secondly, if you for example assumed they were GPUs of the same generation, a GPU with 3000 cores is likely to be somewhat slower than a GPU of 4608 cores. This is because for a given GPU architectural generation, other things like clock speed, memory bandwidth, etc. are all likely to be lower on the GPU with 3000 cores.

In short, I wouldn't assume the inference performance would be the same. It depends on other things besides what you have indicated here. I think it could be faster, and could also be slower, depending on the actual GPUs comparison.

With respect to CUDA GPUs that are currently available, almost anything is likely to be somewhat slower at inference performance than a Titan RTX. The difference may be small, perhaps negligible, or larger, depending on the specific GPU.

Thank you very much for your insightful response! The GPU I'm comparing the Titan RTX with is the Quadro RTX 5000. I notice that the Quadro RTX 5000 has a slightly slower core clock speed (1350 MHz vs. 1620 MHz) and lower memory bandwidth (448 GB/sec vs. 672 GB/sec). Would you say those are the 2 more important factors for inference performance? If so, do you think it seems the inference performance would be significantly worse? — cmed123, Jan 17 '20 at 20:09
I think it will be slower/worse. I really can't say exactly how much. It's quite possible the difference would be negligible. — Robert Crovella, Jan 17 '20 at 20:32

CUDA Cores and Streaming Multiprocessors Count for Inference Speed

1 Answers1