Measuring the total time taken by the kernel when using streams

Question

I am looking to analyse the total time spent on the kernels, running multiple time, and was wondering if this code would give me the total spend on the streamed kernels, or if time returned needed to be multiplied by the number of launches.

cudaEvent_t start, stop;    
cudaEventCreate(&start);
cudaEventCreate(&stop);


for(x=0; x<SIZE; x+=N*2){

     gpuErrchk(cudaMemcpyAsync(data_d0, data_h+x, N*sizeof(char), cudaMemcpyHostToDevice, stream0));
     gpuErrchk(cudaMemcpyAsync(data_d1, data_h+x+N, N*sizeof(char), cudaMemcpyHostToDevice, stream1));


     gpuErrchk(cudaMemcpyAsync(array_d0, array_h, wrap->size*sizeof(node_r), cudaMemcpyHostToDevice, stream0));
     gpuErrchk(cudaMemcpyAsync(array_d1, array_h, wrap->size*sizeof(node_r), cudaMemcpyHostToDevice, stream1));

     cudaEventRecord(start, 0);
        GPU<<<N/512,512,0,stream0>>>(array_d0, data_d0, out_d0 );
        GPU<<<N/512,512,0,stream1>>>(array_d1, data_d1, out_d1);
     cudaEventRecord(stop, 0);

     gpuErrchk(cudaMemcpyAsync(out_h+x, out_d0 , N * sizeof(int), cudaMemcpyDeviceToHost, stream0));
     gpuErrchk(cudaMemcpyAsync(out_h+x+N, out_d1 ,N *  sizeof(int), cudaMemcpyDeviceToHost, stream1));

} 

float elapsedTime;
cudaEventElapsedTime(&elapsedTime, start, stop);
cudaEventDestroy(start);
cudaEventDestroy(stop);
printf("Time %f ms\n", elapsedTime);

score 1 · Accepted Answer · answered Feb 26 '14 at 18:09

It will not capture the total execution time for the kernels for all passes of the loop.

From the documentation:

If cudaEventRecord() has previously been called on event, then this call will overwrite any existing state in event. Any subsequent calls which examine the status of event will only examine the completion of this most recent call to cudaEventRecord().

If you believe that the execution time for each pass through the loop will be approximately the same, then you can just multiply the result by the number of passes.

Note that you should issue a cudaEventSynchronize() call on the stop event, before the call to cudaEventElapsedTime()

score 0 · Answer 2 · edited May 23 '17 at 11:50

Event-based timing was added to CUDA to enable fine-grained timing of on-chip execution (for example, you should get an accurate time even if only one kernel invocation is bracketed by the event start/stop calls). But streams and out-of-order execution introduce ambiguity into the meaning of the "timestamp" recorded by cudaEventRecord(). cudaEventRecord() takes a stream parameter, and as far as I know it respects that stream parameter; but the stream's execution can be affected by other streams, e.g. if they are contending for some resource.

So it is best practice to call cudaEventRecord() on the NULL stream to serialize.

Interestingly, Intel has a similar history with RDTSC, where they introduced superscalar execution and timestamp recording in the same product. (For NVIDIA, it was CUDA 1.1; for Intel, it was the Pentium.) And similarly, Intel had to revise their guidance to developers who relied on RDTSC being a serializing instruction, telling them to serialize explicitly to get meaningful timing results.

Why isn't RDTSC a serializing instruction?

Measuring the total time taken by the kernel when using streams

2 Answers2