CUDA program doesn't measure the execution time : cudaEventRecord

Question

I need to really understand how CUDA works to measure time execution.

Lets focus on this part (the whole code is at the end of the message if you want to test it).

// Launching Kernel and measuring its time
    cudaEventRecord(startComputation);
    MatProd << <BlockPerGrid, ThreadPerBlock >> >(C, A, B, dimAx, dimBx, dimCx, dimCy);
    cudaEventRecord(stopComputation);

    //cudaEventSynchronize(stopComputation); // this line must be HERE and it returns me a good computation time.

    cudaEventElapsedTime(&millisecondsPureComputation, startComputation, stopComputation);

    cudaDeviceSynchronize(); // putting this doesn't do the job
    //cudaEventSynchronize(stopComputation); // if I put it here instead it doesn't work.

    std::cout << "Computation time : " << millisecondsPureComputation << "ms" << std::endl;

How I understand the things. When the program runs, the CPU call the kernel at multiple times. The cudaEventRecord, MatProd and cudaEventElapsedTime are all executed on the GPU.

The time between my two cudaEventRecord is calculated in my cudaEventElapsedTime.

The problem is that : if my CPU is too fast in regards of the GPU computation, the variable millisecondsPureComputaion will keep it's initial value : 0.

Thus I have to say to the CPU "wait that the GPU has finished the cudaEventElapsedTime" before displaying the computation. In this way the variable millisecondsPureComputation will have the value we want.

Thus, putting a cudaDeviceSynchronise(); right after the cudaEventElapsedTime should be enough.

But in practice it doesn't work when I do this, the variable is still 0. The only way to have a non zero number is to put cudaEvntSynchronize(stopComputation) before the cudaEventElapsedTime and I don't understand why.

My questions :

Why my method of putting a cudaDeviceSynchronise(); doesn't work ? Can you explain me why putting a cudaEventSynchronize(stopComputation); before the cudaEventElapsedTime works ? What does it exaclty do ?

#include <iostream>
#include <math.h>
#include <chrono>

__global__  void MatProd(float* C, float* A, float*B, int dimAx, int dimBx, int dimCx, int dimCy)
{
    int row = blockDim.y*blockIdx.y + threadIdx.y;
    int col = blockDim.x*blockIdx.x + threadIdx.x;

    double Result = 0;

    if (row <= dimCy - 1 && col <= dimCx - 1)
    {
        for (int k = 0; k < dimAx; k++)
        {
            Result += A[k + dimAx*row] * B[col + dimBx*k];
        }

        C[col + row*dimCx] = Result;
    }
}

int main(void)
{
    /* Initializing the inputs */
    // Matrix sizes
    int dimAx = 100;
    int dimAy = 100;
    int dimBx = 2;

    int dimBy = dimAx;
    int dimCx = dimBx;
    int dimCy = dimAy;

    // Matrix pointers
    float *A, *B, *C;

    // Variable to measure CUDA time execution.
    float millisecondsPureComputation = 0;
    cudaEvent_t startComputation, stopComputation;
    cudaEventCreate(&startComputation);
    cudaEventCreate(&stopComputation);

    // Memory allocation
    cudaMallocManaged(&A, dimAx*dimAy*sizeof(float));
    cudaMallocManaged(&B, dimBx*dimBy*sizeof(float));
    cudaMallocManaged(&C, dimCx*dimCy*sizeof(float));

    // Initializing matrices
    for (int i = 0; i < dimAy; i++)
    {
        for (int j = 0; j < dimAx; j++) 
        {
            A[j + dimAx*i] = j + 10 * i;
        }
    }
    for (int i = 0; i < dimBy; i++)
    {
        for (int j = 0; j < dimBx; j++)
        {
            B[j + dimBx*i] = (j + 1)*pow(i, 2);
        }
    }

    // Kernel properties

    int threadPerBlockx = 32;
    int threadPerBlocky = 32;
    int BlockPerGridx = 1 + (dimCx - 1) / threadPerBlockx;
    int BlockPerGridy = 1 + (dimCy - 1) / threadPerBlockx;


    dim3 BlockPerGrid(BlockPerGridx, BlockPerGridy, 1);
    dim3 ThreadPerBlock(threadPerBlockx, threadPerBlocky, 1);

    // Launching Kernel and measuring its time
    cudaEventRecord(startComputation);
    MatProd << <BlockPerGrid, ThreadPerBlock >> >(C, A, B, dimAx, dimBx, dimCx, dimCy);
    cudaEventRecord(stopComputation);

    //cudaEventSynchronize(stopComputation); // this line must be HERE and it returns me a good computation time.

    cudaEventElapsedTime(&millisecondsPureComputation, startComputation, stopComputation);

    cudaDeviceSynchronize(); // putting this doesn't do the job
    //cudaEventSynchronize(stopComputation); // if I put it here instead it doesn't work.

    std::cout << "Computation time : " << millisecondsPureComputation << "ms" << std::endl;

    cudaFree(A);
    cudaFree(B);
    cudaFree(C);

    return 0;
}

[edit] I changed the code to it and now it works but I still don't understand what is happening..

cudaEventRecord(startComputation);
        MatProd << <BlockPerGrid, ThreadPerBlock >> >(C, A, B, dimAx, dimBx, dimCx, dimCy);

        //cudaDeviceSynchronize();
        cudaEventRecord(stopComputation);
        cudaDeviceSynchronize();
        cudaEventElapsedTime(&millisecondsPureComputation, startComputation, stopComputation);

Here my questions :

So, the cudaEventRecord(), cudaEventElapsedTime() are here executed on the host in my case, if I understood well (bc of the __host__ in the doc).

On the doc they say that cudaEventRecord capture in the event the content of the steam. It is not totally clear for me by what they call the "content" of the stream.

But then I don't get how it can work. Indeed, if MatProd takes a long time, the CPU will reach the second cudaEventRecord before the GPU has finished its work. So I should get a wrong result...?

I say this because you explained me these API function are executed on the host. So from what I understood they will be launched in parallel of the Kernel. And as we synchronise after the two cudaEventRecord(), I should get a wrong result...?

Maybe it is because I didn't really understood what you mean by executing on the host but I understand it as a function that is launched on the CPU (and thus, it doesn't need to wait for the kernel to be completed).

Robert Crovella · Accepted Answer · 2017-12-17T20:07:27.957

The only thing that executes on the device is code preceded by __global__ or __device__. Everything else, including CUDA runtime API calls, and the actual kernel launch itself, are host code.

You are getting zero because the (second) event has not occurred yet.

Please read the documentation for cudaEventElapsedTime:

If cudaEventRecord() has been called on both events but one or both of them has not yet been completed (that is, cudaEventQuery() would return cudaErrorNotReady on at least one of the events), cudaErrorNotReady is returned.

This is what is happening in your case, and since you are not doing proper CUDA error checking you are blind to it. When both events have not completed (that means the CUDA stream of execution has not reached both events) then the cudaEventElapsedTime() call performs no operation except to return a CUDA error.

If you place a cudaDeviceSynchronize() call or an appropriate cudaEventSynchronize() call before the cudaEventElapsedTime() call, this will force the CPU thread to wait at that point, until the event has completed. This will satisfy the necessary condition for the cudaEventElapsedTime() call, and you will get a sensible value for elapsed time.

Adding further description. Let's consider this step-by-step.

In time period 1, the CPU code "records" the startComputation event into the CUDA stream of execution due to this call: cudaEventRecord(startComputation); The CUDA processor (GPU) is idle. Therefore, at this instant, the particular CUDA event startComputation is considered "RECORDED" but not "COMPLETED"
In time period 2, the CPU thread moves forward to the next item after the previous cudaEventRecord call, which is the kernel launch: MatProd << <BlockPerGrid, ThreadPerBlock >> >(...). During this time period, the CPU places the kernel launch as the next item to be processed in the CUDA stream of execution. The CUDA processor (GPU) has work to do, as a result of the activities in time period 1 above, so it begins to process the event. This processing of the event converts the event from a "RECORDED" state to a "COMPLETED" state.
In time period 3, the CPU thread moves forward to the next item after the previous kernel launch, which is another event recording call: cudaEventRecord(stopComputation); Just like in time period 1, this places an event into the CUDA stream of execution, to be processed after the kernel execution is complete. Therefore this new event is in the "RECORDED" state but not the "COMPLETED" state. During this time period 3, the GPU begins to execute the kernel and is busy executing the kernel.
In time period 4, the CPU thread moves forward to the next item after the previous event record call, which is a request to the runtime API to make a measurement between two events (cudaEventElapsedTime). In order to make this measurement, both events must be in the "COMPLETED" state. During this time period 4, the GPU is still busy processing the kernel, so it has not moved forward to process the stopComputation event that was "RECORDED" but not "COMPLETED" in time period 3. Therefore the first of the two events (startComputation) is in the "COMPLETED" state but the second of the two events (stopComputation) is still in the "RECORDED" state. As a result of this, the cuda runtime API call (as already indicated) will return an error and will not give a sensible measurement. It requires that both events be in the "COMPLETED" state before it will return the requested measurement.

So what is different in your modified code that works and includes a synchronization function prior to the elapsed time request? Let's pick up our timeline replay after the end of time period 3 above, since everything up to that point is unchanged. But time period 4 is different now:

In time period 4, the CPU thread moves forward to process the next item after the CUDA event record call, but this instruction is a synchronizing instruction (cudaDeviceSynchronize()). During this time period 4, the GPU is still busy processing the kernel. Since the CUDA timeline/stream still has work to be done, the CPU thread is halted at the synchronization step. It sits there and waits.
In time period 5, the GPU is still busy processing the kernel. The CPU thread is stuck waiting at the cudaDeviceSynchronize() call.
In time period 6, the GPU is still busy processing the kernel. The CPU thread is stuck waiting at the cudaDeviceSynchronize() call.
In time period 7, the GPU finishes processing the kernel, and moves on to the next piece of work recorded in the CUDA stream, which is the cuda event stopComputation. The processing of this event converts the state of stopComputation from "RECORDED" to "COMPLETED". Since the GPU is still doing something during time period 7, the CPU thread is stuck waiting at the cudaDeviceSynchronize() call.
In time period 8, the GPU has finished processing all work issued to it and returns to the idle state. As a result of this, the CPU is no longer required to wait at the cudaDeviceSynchronize() call, so it moves on to the next item in the CPU thread, which is the request for elapsed time measurement. As a result of the prior activity, both events (startComputation and stopComputation) are in the "COMPLETED" state, and so the event elapsed time measurement request is legal, and the call will return a sensible measurement (and no error).

Thank you for your answer. I am sorry but I still don't get it. I edited my message at the end to say what I don't get. All these synchronisation things makes me crazy. — StarBucK, Dec 17 '17 at 19:24
Wooow thank you a lot for all your help sir. So if I have understood well, when we record an event, we add a task to the GPU. And this task is simply to pass this event from RECORDED to COMPLETED status. So these functions are a trick to know where the GPU is in its stream in a sense. They are like "flag" tasks, RECORDED->COMPLETED is like raising a flag up "ok I've done this task". Am I right with this vision ? And in practice we use thoose event to measure execution times. — StarBucK, Dec 17 '17 at 20:22
Yes, they are like flags. When the GPU gets to that point in the stream, it raises that flag. We measure elapsed time only between two flags that are "raised". If either or both flags are not yet "raised", then the request to measure elapsed time will return an error. — Robert Crovella, Dec 17 '17 at 20:34

CUDA program doesn't measure the execution time : cudaEventRecord

1 Answers1