Cuda profiler says that my two kernels are expensive, however their execution time seems to be small

Question

I use two kernels, let's call them A an B.

I run the CUDA profiler and this is what it returned:

enter image description here

The first kernel has 44% overhead while the second 20%.

However, if I decide to find out the actual execution time by following this logic:

timeval tim;
gettimeofday(&tim, NULL);
double before = tim.tv_sec+(tim.tv_usec/1000000.0);

runKernel<<<...>>>(...)

gettimeofday(&tim, NULL);
double after=tim.tv_sec+(tim.tv_usec/1000000.0);
totalTime = totalTime + after - before;

The totalTime will be very small, somewhere around 0.0001 seconds.

I'm new to CUDA and I don't understand exactly what's going on. Should I try and make the kernels more efficient or are they already efficient?

score 3 · Accepted Answer · edited May 23 '17 at 12:21

Kernel calls are asynchronous from the point of view of the CPU (see this answer). If you time your kernel the way you do without any synchronization (i.e. without calling cudaDeviceSynchronize()), your timings will not mean anything since computation is still in progress on the GPU.

You can trust NVIDIA's profilers when it comes to timing your kernels (nvprof / nvvp). The NVIDIA Visual Profiler can also analyze your program and provide some pieces of advice on what may be wrong with your kernels: uncoalesced memory accesses, unefficient number of threads/blocks assigned etc. You also need to compile your code in release mode with optimization flags (e.g. -O3) to get some relevant timings.

Concerning kernel optimization, you need to find your bottlenecks (e.g. your 44% kernel), analyze it, and apply the usual optimization techniques:

Use the effective bandwidth of your device to work out what the upper bound on performance ought to be for your kernel

Minimize memory transfers between host and device - even if that means doing calculations on the device which are not efficient there

Coalesce all memory accesses

Prefer shared memory access to global memory access

Avoid code execution branching within a single warp as this serializes the threads

You can also use instruction-level parallelism (you should read these slides).

It is however hard to know when you cannot optimize your kernels anymore. Saying that the execution time of your kernels is small does not mean much: small compared to what? Are you trying to do some real time computation? Is scalability an issue? These are some of the questions that you need to answer before trying to optimize your kernels.

On a side note, you should also use error checking extensively, and rely on cuda-memcheck/cuda-gdb to debug your code.

thank you very much, may I ask you one last thing which has nothing to do with the question, would the execution times be correct if after the end of the kernel there was a `cudaMemcpy` from the device to host and a `cudaFree`? More specifically, would using the `gettimeofday` functions before the kernel declaration and after the `cudaFree` in the case I described above give correct results? — ksm001, May 15 '13 at 12:50
[`cudaMemcpy()`](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1g48efa06b81cc031b2aa6fdc2e9930741) is synchronous, so normally your CPU timer should return a correct value. Note that there is also an asynchronous version: [`cudaMemcpyAsync()`](http://docs.nvidia.com/cuda/cuda-runtime-api/index.html#group__CUDART__MEMORY_1gf2810a94ec08fe3c7ff9d93a33af7255). You could check the example described on [this page](http://cedric-augonnet.com/cuda-kernels-launches-in-the-null-stream-are-not-synchronous/). — BenC, May 15 '13 at 12:58

Cuda profiler says that my two kernels are expensive, however their execution time seems to be small

1 Answers1