I need to really understand how CUDA works to measure time execution.
Lets focus on this part (the whole code is at the end of the message if you want to test it).
// Launching Kernel and measuring its time
cudaEventRecord(startComputation);
MatProd << <BlockPerGrid, ThreadPerBlock >> >(C, A, B, dimAx, dimBx, dimCx, dimCy);
cudaEventRecord(stopComputation);
//cudaEventSynchronize(stopComputation); // this line must be HERE and it returns me a good computation time.
cudaEventElapsedTime(&millisecondsPureComputation, startComputation, stopComputation);
cudaDeviceSynchronize(); // putting this doesn't do the job
//cudaEventSynchronize(stopComputation); // if I put it here instead it doesn't work.
std::cout << "Computation time : " << millisecondsPureComputation << "ms" << std::endl;
How I understand the things. When the program runs, the CPU call the kernel at multiple times. The cudaEventRecord, MatProd and cudaEventElapsedTime are all executed on the GPU.
The time between my two cudaEventRecord is calculated in my cudaEventElapsedTime.
The problem is that : if my CPU is too fast in regards of the GPU computation, the variable millisecondsPureComputaion will keep it's initial value : 0.
Thus I have to say to the CPU "wait that the GPU has finished the cudaEventElapsedTime" before displaying the computation. In this way the variable millisecondsPureComputation will have the value we want.
Thus, putting a cudaDeviceSynchronise(); right after the cudaEventElapsedTime should be enough.
But in practice it doesn't work when I do this, the variable is still 0. The only way to have a non zero number is to put cudaEvntSynchronize(stopComputation) before the cudaEventElapsedTime and I don't understand why.
My questions :
Why my method of putting a cudaDeviceSynchronise(); doesn't work ? Can you explain me why putting a cudaEventSynchronize(stopComputation); before the cudaEventElapsedTime works ? What does it exaclty do ?
#include <iostream>
#include <math.h>
#include <chrono>
__global__ void MatProd(float* C, float* A, float*B, int dimAx, int dimBx, int dimCx, int dimCy)
{
int row = blockDim.y*blockIdx.y + threadIdx.y;
int col = blockDim.x*blockIdx.x + threadIdx.x;
double Result = 0;
if (row <= dimCy - 1 && col <= dimCx - 1)
{
for (int k = 0; k < dimAx; k++)
{
Result += A[k + dimAx*row] * B[col + dimBx*k];
}
C[col + row*dimCx] = Result;
}
}
int main(void)
{
/* Initializing the inputs */
// Matrix sizes
int dimAx = 100;
int dimAy = 100;
int dimBx = 2;
int dimBy = dimAx;
int dimCx = dimBx;
int dimCy = dimAy;
// Matrix pointers
float *A, *B, *C;
// Variable to measure CUDA time execution.
float millisecondsPureComputation = 0;
cudaEvent_t startComputation, stopComputation;
cudaEventCreate(&startComputation);
cudaEventCreate(&stopComputation);
// Memory allocation
cudaMallocManaged(&A, dimAx*dimAy*sizeof(float));
cudaMallocManaged(&B, dimBx*dimBy*sizeof(float));
cudaMallocManaged(&C, dimCx*dimCy*sizeof(float));
// Initializing matrices
for (int i = 0; i < dimAy; i++)
{
for (int j = 0; j < dimAx; j++)
{
A[j + dimAx*i] = j + 10 * i;
}
}
for (int i = 0; i < dimBy; i++)
{
for (int j = 0; j < dimBx; j++)
{
B[j + dimBx*i] = (j + 1)*pow(i, 2);
}
}
// Kernel properties
int threadPerBlockx = 32;
int threadPerBlocky = 32;
int BlockPerGridx = 1 + (dimCx - 1) / threadPerBlockx;
int BlockPerGridy = 1 + (dimCy - 1) / threadPerBlockx;
dim3 BlockPerGrid(BlockPerGridx, BlockPerGridy, 1);
dim3 ThreadPerBlock(threadPerBlockx, threadPerBlocky, 1);
// Launching Kernel and measuring its time
cudaEventRecord(startComputation);
MatProd << <BlockPerGrid, ThreadPerBlock >> >(C, A, B, dimAx, dimBx, dimCx, dimCy);
cudaEventRecord(stopComputation);
//cudaEventSynchronize(stopComputation); // this line must be HERE and it returns me a good computation time.
cudaEventElapsedTime(&millisecondsPureComputation, startComputation, stopComputation);
cudaDeviceSynchronize(); // putting this doesn't do the job
//cudaEventSynchronize(stopComputation); // if I put it here instead it doesn't work.
std::cout << "Computation time : " << millisecondsPureComputation << "ms" << std::endl;
cudaFree(A);
cudaFree(B);
cudaFree(C);
return 0;
}
[edit] I changed the code to it and now it works but I still don't understand what is happening..
cudaEventRecord(startComputation);
MatProd << <BlockPerGrid, ThreadPerBlock >> >(C, A, B, dimAx, dimBx, dimCx, dimCy);
//cudaDeviceSynchronize();
cudaEventRecord(stopComputation);
cudaDeviceSynchronize();
cudaEventElapsedTime(&millisecondsPureComputation, startComputation, stopComputation);
Here my questions :
- So, the cudaEventRecord(), cudaEventElapsedTime() are here executed on the host in my case, if I understood well (bc of the
__host__
in the doc).
On the doc they say that cudaEventRecord capture in the event the content of the steam. It is not totally clear for me by what they call the "content" of the stream.
But then I don't get how it can work. Indeed, if MatProd takes a long time, the CPU will reach the second cudaEventRecord before the GPU has finished its work. So I should get a wrong result...?
I say this because you explained me these API function are executed on the host. So from what I understood they will be launched in parallel of the Kernel. And as we synchronise after the two cudaEventRecord(), I should get a wrong result...?
Maybe it is because I didn't really understood what you mean by executing on the host but I understand it as a function that is launched on the CPU (and thus, it doesn't need to wait for the kernel to be completed).