How to evaluate CUDA performance?

Question

I programmed CUDA kernel my own. Compare to CPU code, my kernel code is 10 times faster than CPUs.

But I have question with my experiments.

Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?

How can I evaluate my kernel code's performance?

How can I calcuate CUDA's maximum throughput theoretically?

Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?

Thanks in advance.

score 5 · Accepted Answer · edited May 23 '17 at 12:31

Does my program fully be optimized using all GPU cores, proper shared memory use, adequate register count, enough occupancy?

To find this out, you use one of the CUDA profilers. See How Do You Profile & Optimize CUDA Kernels?

How can I calcuate CUDA's maximum throughput theoretically?

That math is slightly involved, different for each architecture and easy to get wrong. Better to look the numbers up in the specs for your chip. There are tables on Wikipedia, such as this one, for the GTX500 cards. For instance, you can see from the table that a GTX580 has a theoretical peak bandwidth of 192.4GB/s and compute throughput of 1581.1GFLOPs.

Am I right that compare between CPU's GFLOPS and GPU's GFLOPS and GFLOPS rate is transparent theoretical performance of them?

If I understand correctly, you are asking if the number of theoretical peak GFLOPs on a GPU can be directly compared with the corresponding number on a CPU. There are some things to consider when comparing these numbers:

Older GPUs did not support double precision (DP) floating point, only single precision (SP).
GPUs that do support DP do so with a significant performance degradation as compared to SP. The GFLOPs number I quoted above was for SP. On the other hand, numbers quoted for CPUs are often for DP, and there is less difference between the performance of SP and DP on a CPU.
CPU quotes can be for rates that are achievable only when using SIMD (single instruction, multiple data) vectorized instructions, and is typically very hard to write algorithms that can approach the theoretical maximum (and they may have to be written in assembly). Sometimes, CPU quotes are for a combination of all computing resources available through different types of instructions and it often virtually impossible to write a program that can exploit them all simultaneously.
The rates quoted for GPUs assume that you have enough parallel work to saturate the GPU and that your algorithm is not bandwidth bound.

I have extra question. Suppose CPU : 2.8GHz, 1 Core GPU : 1.6GHz, 384 core (GTX 560 Ti Spec) In this example, CPU's expected performance is 2.8GHz x 1 core = 2.8GHz·Core GPU's expected performance is 0.8GHz x 384 core = 307.2 Ghz·Core This calculation is valid? — bongmo.kim, Aug 12 '12 at 11:30
You can't measure performance in GHz. To find theoretical performance, you have to find out what the CPU/GPU can do in each clock cycle. And the considerations I mentioned earlier affect the performance you can get. For instance, if you're looking at the performance of an Intel Sandy Bridge, you have to take into account that it has an instruction set called AVX that can perform 8 SP operations per instruction. — Roger Dahl, Aug 12 '12 at 14:45
Also, the Sandy Bridge can run 2 AVX operations per clock, plus an AVX load/store. So that would indicate 2.8GHz * 8 SP per AVX * 2 AVX per clock = 44.8GFLOPS per core. On a 8-core chip, that would add up to 8 * 44.8GFLOPS = 358.4GFLOPS. I haven't researched these numbers, so take them with a grain of salt. I just added these to show the danger of attempting to find performance by directly multiplying the frequency with the number of cores. — Roger Dahl, Aug 12 '12 at 15:26
@chaohuang I don't have enough reputation.. so I can't upvote this article. The replies are pretty good for me. I'll accept the reply that is helpful to me. — bongmo.kim, Aug 21 '12 at 01:13

score 3 · Answer 2 · answered Aug 15 '12 at 03:34

The preferred measure of performance is elapsed time. GFLOPs can be used as a comparison method but it is often difficult to compare between compilers and architectures due to differences in instruction set, compiler code generation, and method of counting FLOPs.

The best method is to time the performance of the application. For the CUDA code you should time all code that will occur per launch. This includes memory copies and synchronization.

Nsight Visual Studio Edition and the Visual Profiler provide the most accurate measurement of each operation. Nsight Visual Studio Edition provides theoretical bandwidth and FLOPs values for each device. In addition the Achieved FLOPs experiment can be used to capture the FLOP count both for single and double precision.

I check the performance of CPU and GPU by using elapsed time. My question is the difference rate of them is 10 times between CPU and GPU. But Can I say that 10 times (rate) is best performance? If Yes, Why? If No, Why?? That is my question.. — bongmo.kim, Aug 21 '12 at 01:15

How to evaluate CUDA performance?

2 Answers2

Linked