Timing different sections in CUDA kernel

Question

I have a CUDA kernel that calls out to a series of device functions.

What is the best way to get the execution time for each of the device functions?

What is the best way to get the execution time for a section of code in one of the device functions?

score 7 · Accepted Answer · edited May 23 '17 at 12:06

In my own code, I use the clock() function to get precise timings. For convenience, I have the macros

enum {
    tid_this = 0,
    tid_that,
    tid_count
    };
__device__ float cuda_timers[ tid_count ];
#ifdef USETIMERS
 #define TIMER_TIC clock_t tic; if ( threadIdx.x == 0 ) tic = clock();
 #define TIMER_TOC(tid) clock_t toc = clock(); if ( threadIdx.x == 0 ) atomicAdd( &cuda_timers[tid] , ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) );
#else
 #define TIMER_TIC
 #define TIMER_TOC(tid)
#endif

These can then be used to instrument the device code as follows:

__global__ mykernel ( ... ) {

    /* Start the timer. */
    TIMER_TIC

    /* Do stuff. */
    ...

    /* Stop the timer and store the results to the "timer_this" counter. */
    TIMER_TOC( tid_this );

    }

You can then read the cuda_timers in the host code.

A few notes:

The timers work on a per-block basis, i.e. if you have 100 blocks executing the same kernel, the sum of all their times will be stored.
Having said that, the timer assumes that the zeroth thread is active, so make sure you do not call these macros in a possibly divergent part of the code.
The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.
The timers can slow down your code a bit, which is why I wrapped them in the #ifdef USETIMERS so you can switch them off easily.
Although clock() returns integer values of type clock_t, I store the accumulated values as float, otherwise the values will wrap around for kernels that take longer than a few seconds (accumulated over all blocks).
The selection ( toc > tic ) ? (toc - tic) : ( toc + (0xffffffff - tic) ) ) is necessary in case the clock counter wraps around.

P.S. This is a copy of my reply to this question, which didn't get many points there since the timing required was for the whole kernel.

Thank you. Very useful. Looking up `clock()`, I found that there is also a `clock64()`, which might remove the need for overflow checking and conversion to float. — Roger Dahl, Jun 26 '12 at 14:30
@RogerDahl: Thanks for pointing that out! It seems to have been added with CUDA 4.2. — Pedro, Jun 26 '12 at 14:35
Fermi added a 64-bit clock result. Clock64 was added well before CUDA 4.2. Note that when doing this type of timing, you have to be careful about divergence -- if different warps take different paths within your timing, timing only thread 0 will not be accurate. — harrism, Jun 26 '12 at 22:14
Further to that, also be sure to disassemble the compiler output and make sure that reordering hasn't occurred. The compiler and assembler (at least the older open64 toolchain)can and do move code around, this can mean that clock calls can wind up next to be another instead of bracketing the code you intended to. — talonmies, Jun 27 '12 at 05:14
@harrism: I was a bit imprecise about that. The function `clock64()` only appeared in the CUDA Programming Guide as of version 4.2. As for your first point, I have updated my answer accordingly. Thanks! — Pedro, Jun 27 '12 at 08:43
@talonmies: Good point. If you're worried about re-ordering, you could could implement the macros using inline `volatile` assembly, e.g. as described on the bottom of page 7 of the [Using PTX Assembly in CUDA](http://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/Using_Inline_PTX_Assembly_In_CUDA.pdf) guide. — Pedro, Jun 27 '12 at 08:52

Timing different sections in CUDA kernel

1 Answers1

Linked