CUDA clock() leads to zero clock cycles

Question

I want to use clock() to compare different kernel implementations. I tried to implement it in a simple SAXPY example but it leads to zero clock cycles, which is very unlikely.

I already found some examples on how to implement the clock(). here and here. But somehow the transfer to my code does not work.

Here is the code that I am using:

/* SAXPY code example from  https://devblogs.nvidia.com/parallelforall/easy-introduction-cuda-c-and-c/ */

#include <stdio.h>

// The declaration specifier __global__ defines a kernel. This code
// will be copied to the device and will be executed there in parallel
__global__
void saxpy(int n, float a, float *x, float *y, int *kernel_clock)
{
  // The indexing of the single threads is done with the following
  // code line
  int i = blockIdx.x*blockDim.x + threadIdx.x;

  clock_t start = clock();

  // Each thread is executing just one position of the arrays
  if (i < n) y[i] = a*x[i] + y[i];

  clock_t stop = clock();

  kernel_clock[i] = (int) (stop-start);
}

int main(void)
{
  // Clock cycles of threads
  int *kernel_clock;
  int *d_kernel_clock;
  // Creating a huge number
  int N = 1<<20;
  float *x, *y, *d_x, *d_y;
  // Allocate an array on the *host* of the size of N
  x = (float*)malloc(N*sizeof(float));
  y = (float*)malloc(N*sizeof(float));
  kernel_clock = (int*)malloc(N*sizeof(int));

  // Allocate an array on the *device* of the size of N
  cudaMalloc(&d_x, N*sizeof(float));
  cudaMalloc(&d_y, N*sizeof(float));
  cudaMalloc(&d_kernel_clock, N*sizeof(int));

  // Filling the array of the host
  for (int i = 0; i < N; i++) {
    x[i] = 1.0f;
    y[i] = 2.0f;
  }

  // Copy the host array to the device array
  cudaMemcpy(d_x, x, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_y, y, N*sizeof(float), cudaMemcpyHostToDevice);
  cudaMemcpy(d_kernel_clock, kernel_clock, N*sizeof(int), cudaMemcpyHostToDevice);

  // Perform SAXPY on 1M elements. The triple chevrons dedicates how
  // the threads are grouped on the device
  saxpy<<<(N+255)/256, 256>>>(N, 2.0f, d_x, d_y, d_kernel_clock);
  cudaDeviceSynchronize();

  // Copy the result from the device to the host
  cudaMemcpy(y, d_y, N*sizeof(float), cudaMemcpyDeviceToHost);
  cudaMemcpy(kernel_clock, d_kernel_clock, N*sizeof(int), cudaMemcpyDeviceToHost);

  // Calculate average clock time
  float average_clock = 0;
  for (int i = 0; i < N; i++) {
      average_clock += (float) (kernel_clock[i]);
  }
  average_clock /= N;

  // Display the time to the screen
  printf ("Kernel clock cycles:   %.4f\n", average_clock);

  // Free the memory on the host and device
  free(x);
  free(y);
  free(kernel_clock);
  cudaFree(d_x);
  cudaFree(d_y);
  cudaFree(d_kernel_clock);
}

This code example leads to:

Kernel clock cycles:   0.0000

I am not sure what I am doing wrong. So my question is: How do I actually get a reasonable result?

I don't see any error checking. What happens if you run your code with `cuda-memcheck` ? — Robert Crovella, Oct 03 '16 at 11:35
`cuda-memcheck` delivers 0 Errors `======== ERROR SUMMARY: 0 errors` — stebran, Oct 04 '16 at 16:01

score 1 · Accepted Answer · answered Oct 03 '16 at 14:32

Quoting from one of the answers you linked to in your question

You should also be aware that the compiler and assembler do perform instruction re-ordering so you might want to check that the clock calls don't wind up getting put next to each other in the SASS output (use cuobjdump to check).

I believe this is the source of your problem. If I compile your kernel with the CUDA 8 release toolkit and then disassemble the resulting machine code with cuobjdump, I get the following:

    code for sm_52
            Function : _Z5saxpyifPfS_Pi
    .headerflags    @"EF_CUDA_SM52 EF_CUDA_PTX_SM(EF_CUDA_SM52)"
                                                                                           /* 0x001c4400fe0007f6 */
    /*0008*/                   MOV R1, c[0x0][0x20];                                       /* 0x4c98078000870001 */
    /*0010*/         {         CS2R R7, SR_CLOCKLO;                                        /* 0x50c8000005070007 */
    /*0018*/                   S2R R0, SR_CTAID.X;        }                                /* 0xf0c8000002570000 */
                                                                                           /* 0x083fc400e3e007f0 */
    /*0028*/         {         CS2R R8, SR_CLOCKLO;                                        /* 0x50c8000005070008 */
    /*0030*/                   S2R R2, SR_TID.X;        }                                  /* 0xf0c8000002170002 */
    /*0038*/                   XMAD.MRG R3, R0.reuse, c[0x0] [0x8].H1, RZ;                 /* 0x4f107f8000270003 */
                                                                                           /* 0x081fc400fec207f6 */
    /*0048*/                   XMAD R2, R0.reuse, c[0x0] [0x8], R2;                        /* 0x4e00010000270002 */
    /*0050*/                   XMAD.PSL.CBCC R0, R0.H1, R3.H1, R2;                         /* 0x5b30011800370000 */
    /*0058*/                   ISETP.GE.AND P0, PT, R0.reuse, c[0x0][0x140], PT;           /* 0x4b6d038005070007 */
                                                                                           /* 0x001fd400fc2007ec */
    /*0068*/                   SHR R9, R0, 0x1f;                                           /* 0x3829000001f70009 */
    /*0070*/              @!P0 SHF.L.U64 R2, RZ, 0x2, R0;                                  /* 0x36f800400028ff02 */
    /*0078*/              @!P0 SHF.L.U64 R3, R0, 0x2, R9;                                  /* 0x36f804c000280003 */
                                                                                           /* 0x001fc040fe4207f6 */
    /*0088*/              @!P0 IADD R4.CC, R2.reuse, c[0x0][0x148];                        /* 0x4c10800005280204 */
    /*0090*/              @!P0 IADD.X R5, R3.reuse, c[0x0][0x14c];                         /* 0x4c10080005380305 */
    /*0098*/         {    @!P0 IADD R2.CC, R2, c[0x0][0x150];                              /* 0x4c10800005480202 */
    /*00a8*/              @!P0 LDG.E R4, [R4];        }                                    /* 0x0005c400fe400076 */
                                                                                           /* 0xeed4200000080404 */
    /*00b0*/              @!P0 IADD.X R3, R3, c[0x0][0x154];                               /* 0x4c10080005580303 */
    /*00b8*/              @!P0 LDG.E R6, [R2];                                             /* 0xeed4200000080206 */
                                                                                           /* 0x001fd800fea007e1 */
    /*00c8*/                   LEA R10.CC, R0, c[0x0][0x158], 0x2;                         /* 0x4bd781000567000a */
    /*00d0*/                   IADD R8, -R7, R8;                                           /* 0x5c12000000870708 */
    /*00d8*/                   LEA.HI.X R9, R0, c[0x0][0x15c], R9, 0x2;                    /* 0x1a17048005770009 */
                                                                                           /* 0x001fc008fe4007f1 */
    /*00e8*/                   MOV R7, R9;                                                 /* 0x5c98078000970007 */
    /*00f0*/              @!P0 FFMA R0, R4, c[0x0][0x144], R6;                             /* 0x4980030005180400 */
    /*00f8*/         {         MOV R6, R10;                                                /* 0x5c98078000a70006 */
    /*0108*/              @!P0 STG.E [R2], R0;        }                                    /* 0x001ffc005e2001f2 */
                                                                                           /* 0xeedc200000080200 */
    /*0110*/                   STG.E [R6], R8;                                             /* 0xeedc200000070608 */
    /*0118*/                   EXIT;                                                       /* 0xe30000000007000f */
                                                                                           /* 0x001f8000fc0007ff */
    /*0128*/                   BRA 0x120;                                                  /* 0xe2400fffff07000f */
    /*0130*/                   NOP;                                                        /* 0x50b0000000070f00 */
    /*0138*/                   NOP;                                                        /* 0x50b0000000070f00 */
            .................................

You can see that the clock instructions have been reordered so that they are called without any code in between them. That will result in a zero, or very close to zero clock measurement for many, if not all, warps running this code.

Thanks! I understand now the problem but in which lines do I see this in your output? — stebran, Oct 04 '16 at 16:10

CUDA clock() leads to zero clock cycles

1 Answers1