Cuda kernel time measurement with CudaEventElapsedTime

Question

I've got NVS 5400M and I'm trying to get reliable time measurement results for cuda addition on matrix (instance 1000 x 1000).

__global__ void MatAdd(int** A, int** B, int** C) {
int i = threadIdx.x;
int j = threadIdx.y;
C[i][j] = A[i][j] + B[i][j]; }

And I'm doing measurement like:

int numBlocks = 1;
dim3 threadsPerBlock(1000, 1000);

float time;
cudaEvent_t start, stop;
cudaEventCreate(&start);
cudaEventCreate(&stop);
cudaEventRecord(start, 0);

MatAdd <<<numBlocks, threadsPerBlock>>>(pA, pB, pC);

cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);
cudaEventElapsedTime(&time, start, stop);

cout << setprecision(10) << "GPU Time [ms] " << time << endl;

and the result is: 0.001504000043 ms, which is relatively small. My question is am I doing it right?

score 1 · Accepted Answer · edited May 23 '17 at 10:33

1

Your timing is correct, but your usage of CUDA in general is not.

This is illegal:

dim3 threadsPerBlock(1000, 1000);

CUDA kernels are limited to a maximum of 1024 threads per block, but you are requesting 1000x1000 = 1,000,000 threads per block.

As a result, your kernel is not actually launching:

MatAdd <<<numBlocks, threadsPerBlock>>>(pA, pB, pC);

And so the measured time is quite short.

You are advised to use proper cuda error checking and run your tests with cuda-memcheck to make sure there are no reported runtime errors (my guess is right now you are not even aware of the errors being reported from your code - you have to check for them.)

Since you haven't shown a complete code, I'm not going to try to identify all other issues that may be present, but your kernel code would have to be re-factored in order to handle a 1000x1000 array properly, and passing double-pointer (e.g. int** A) parameters to kernels is considerably more difficult than a single pointer or "flat" array.

edited May 23 '17 at 10:33

Community

1
1

answered May 09 '16 at 16:24

Robert Crovella

143,785
11
213
257

After checking cudError_t I reveived message: sync kernel error: invalid configuration argument. Thank you! – azet52 May 09 '16 at 16:35
@Edit: I've changed to `dim3 threadsPerBlock(32, 32)` and the actual code is: `int** A = create_random_matrix(1000);` `int** pA = initialize_matrix(1000);` `cudaMalloc((void**)pA, (1000*1000)*sizeof(int));` `cudaMemcpy(pA, A, (1000*1000)*sizeof(int), cudaMemcpyHostToDevice)` the same for B & C, and after that `MatAdd <<>>(pA, pB, pC);` `cudaMemcpy(C, pC, (1000*1000)*sizeof(int), cudaMemcpyDeviceToHost);` and received: invalid argument & illegal memory acces was encoutered – azet52 May 09 '16 at 17:03
the illegal memory access is because you don't understand how to manage double-pointer arguments to kernels correctly. It will require a deep copy. For simplicity, I would suggest flattening your arrays so they can be referenced with a single subscript. If you want help with a code that you haven't shown and a new problem, I suggest posting a new question. – Robert Crovella May 09 '16 at 17:05

Cuda kernel time measurement with CudaEventElapsedTime

1 Answers1