6

I'd like to measure the time a bit of code within my kernel takes. I've followed this question along with its comments so that my kernel looks something like this:

__global__ void kernel(..., long long int *runtime)
{
    long long int start = 0; 
    long long int stop = 0;

    asm volatile("mov.u64 %0, %%clock64;" : "=l"(start));

    /* Some code here */

    asm volatile("mov.u64 %0, %%clock64;" : "=l"(stop));

    runtime[threadIdx.x] = stop - start;
    ...
}

The answer says to do a conversion as follows:

The timers count the number of clock ticks. To get the number of milliseconds, divide this by the number of GHz on your device and multiply by 1000.

For which I do:

for(long i = 0; i < size; i++)
{
  fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
}

Where 1.62 is the GPU Max Clock rate of my device. But the time I get in milliseconds does not look right because it suggests that each thread took minutes to complete. This cannot be correct as execution finishes in less than a second of wall-clock time. Is the conversion formula incorrect or am I making a mistake somewhere? Thanks.

Community
  • 1
  • 1
John
  • 652
  • 7
  • 22
  • 2
    Divide by the number of hertz, not GHz. Divide by `1620000000.0f`. clocks cycles divided by clock cycles per second gives you the number of seconds. Multiply the number of seconds by 1000 to get the number of milliseconds. – Robert Crovella Mar 24 '17 at 20:33
  • @RobertCrovella, now works as expected, thanks!. If you post this as the answer I'd happily mark it as accepted. – John Mar 24 '17 at 21:52

2 Answers2

8

The correct conversion in your case is not GHz:

fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1.62)*1000.0);
                                                             ^^^^

but hertz:

fprintf(stdout, "%d:%ld=%f(ms)\n", i,runtime[i], (runtime[i]/1620000000.0f)*1000.0);
                                                             ^^^^^^^^^^^^^

In the dimensional analysis:

                  clock cycles
clock cycles  /  -------------- = seconds
                   second
                    

the first term is the clock cycle measurement. The second term is the frequency of the GPU (in hertz, not GHz), the third term is the desired measurement (seconds). You can convert to milliseconds by multiplying seconds by 1000.

Here's a worked example that shows a device-independent way to do it (so you don't have to hard-code the clock frequency):

$ cat t1306.cu
#include <stdio.h>

const long long delay_time = 1000000000;
const int nthr = 1;
const int nTPB = 256;

__global__ void kernel(long long *clocks){

  int idx=threadIdx.x+blockDim.x*blockIdx.x;
  long long start=clock64();
  while (clock64() < start+delay_time);
  if (idx < nthr) clocks[idx] = clock64()-start;
}

int main(){

  int peak_clk = 1;
  int device = 0;
  long long *clock_data;
  long long *host_data;
  host_data = (long long *)malloc(nthr*sizeof(long long));
  cudaError_t err = cudaDeviceGetAttribute(&peak_clk, cudaDevAttrClockRate, device);
  if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
  err = cudaMalloc(&clock_data, nthr*sizeof(long long));
  if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
  kernel<<<(nthr+nTPB-1)/nTPB, nTPB>>>(clock_data);
  err = cudaMemcpy(host_data, clock_data, nthr*sizeof(long long), cudaMemcpyDeviceToHost);
  if (err != cudaSuccess) {printf("cuda err: %d at line %d\n", (int)err, __LINE__); return 1;}
  printf("delay clock cycles: %ld, measured clock cycles: %ld, peak clock rate: %dkHz, elapsed time: %fms\n", delay_time, host_data[0], peak_clk, host_data[0]/(float)peak_clk);
  return 0;
}
$ nvcc -arch=sm_35 -o t1306 t1306.cu
$ ./t1306
delay clock cycles: 1000000000, measured clock cycles: 1000000210, peak clock rate: 732000kHz, elapsed time: 1366.120483ms
$

This uses cudaDeviceGetAttribute to get the clock rate, which returns a result in kHz, which allows us to easily compute milliseconds in this case.

In my experience, the above method works generally well on datacenter GPUs that have the clock rate running at the reported rate (may be affected by settings you make in nvidia-smi.) Other GPUs such as GeForce GPUs may be running at (unpredictable) boost clocks that will make this method inaccurate.

Also, more recently, CUDA has the ability to preempt activity on the GPU. This can come about in a variety of circumstances, such as debugging, CUDA dynamic parallelism, and other situations. If preemption occurs for whatever reason, attempting to measure anything based on clock64() is generally not reliable.

Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I don't quite understand the importance of the delay time and `while (clock64() < start+delay_time);`. Can you explain? Thanks. – John Apr 27 '17 at 17:09
  • I'm assuming it purely simulates a delay that would be caused by some real work, is this assumption correct? – John Apr 27 '17 at 17:16
  • Yes, it forces the kernel duration to last at least as long as the `delay_time` – Robert Crovella Apr 27 '17 at 17:52
6

clock64 returns a value in graphics clock cycles. The graphics clock is dynamic so I would not recommend using a constant to try to convert to seconds. If you want to convert to wall time then the better option is to use globaltimer, which is a 64-bit clock register accessible as:

asm volatile("mov.u64 %0, %%globaltimer;" : "=l"(start));

The unit is in nanoseconds.

The default resolution is 32ns with update every µs. The NVIDIA performance tools force the update to every 32 ns (or 31.25 MHz). This clock is used by CUPTI for start time when capturing concurrent kernel trace.

talonmies
  • 70,661
  • 34
  • 192
  • 269
Greg Smith
  • 11,007
  • 2
  • 36
  • 37