Kernel calls are asynchronous from the point of view of the CPU (see this answer). If you time your kernel the way you do without any synchronization (i.e. without calling cudaDeviceSynchronize()
), your timings will not mean anything since computation is still in progress on the GPU.
You can trust NVIDIA's profilers when it comes to timing your kernels (nvprof
/ nvvp
). The NVIDIA Visual Profiler can also analyze your program and provide some pieces of advice on what may be wrong with your kernels: uncoalesced memory accesses, unefficient number of threads/blocks assigned etc. You also need to compile your code in release mode with optimization flags (e.g. -O3
) to get some relevant timings.
Concerning kernel optimization, you need to find your bottlenecks (e.g. your 44% kernel), analyze it, and apply the usual optimization techniques:
- Use the effective bandwidth of your device to work out what the upper bound on performance ought to be for your kernel
- Minimize memory transfers between host and device - even if that means doing calculations on the device which are not efficient there
- Coalesce all memory accesses
- Prefer shared memory access to global memory access
- Avoid code execution branching within a single warp as this serializes the threads
You can also use instruction-level parallelism (you should read these slides).
It is however hard to know when you cannot optimize your kernels anymore. Saying that the execution time of your kernels is small does not mean much: small compared to what? Are you trying to do some real time computation? Is scalability an issue? These are some of the questions that you need to answer before trying to optimize your kernels.
On a side note, you should also use error checking extensively, and rely on cuda-memcheck
/cuda-gdb
to debug your code.