Any time you are having trouble with a CUDA code, it's advisable to run your code with cuda-memcheck
and also add proper cuda error checking.
This:
dim3 dimBlock(BLOCK_DIM, BLOCK_DIM);
is not legal in CUDA for BLOCK_DIM
larger than 22 or 32 (depending on GPU and CUDA version). CUDA kernels are limited to 512 or 1024 threads per block, which is the product of the individual threadblock dimensions. So in your case BLOCK_DIM
*BLOCK_DIM
must be less than 512 or 1024 depending on GPU and CUDA version. Setting BLOCK_DIM
to 512 cannot work, in any case.
If you are creating a variable like this on the stack:
float matrix1[N][N];
that will cause trouble as N
gets larger, because you may run into limits on stack size. (This is not related to CUDA.) Instead create these variables dynamically (an example is given in the code below).
The following code (built around the pieces you have shown) seems to work for me, and implements the above changes. I've omitted proper cuda error check for brevity of presentation, but I recommend using it if you are having trouble with a CUDA code. In lieu of that, I am running it with cuda-memcheck
:
$ cat t1002.cu
#include <stdio.h>
#include <math.h>
const size_t BLOCK_DIM = 16;
const size_t MY_N = 2048;
const float tval1 = 1.0f;
const float tval2 = 2.0f;
__global__ void parMatrixAdd_kernel(float *a, float *b, float *c, int N) {
int col = threadIdx.x + blockIdx.x * blockDim.x;
int row = threadIdx.y + blockIdx.y * blockDim.y;
int index = col + row * N;
if (col < N && row < N) {
c[index] = a[index] + b[index];
}
}
typedef float my_mat[MY_N];
int main(){
my_mat *A, *B, *C;
const size_t N = MY_N;
size_t dsize = N*N*sizeof(float);
A = (my_mat *)malloc(dsize);
B = (my_mat *)malloc(dsize);
C = (my_mat *)malloc(dsize);
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++) {
A[i][j] = tval1;
B[i][j] = tval2;
C[i][j] = 0.0f;}
float *d_A, *d_B, *d_C;
cudaMalloc(&d_A, dsize);
cudaMalloc(&d_B, dsize);
cudaMalloc(&d_C, dsize);
cudaMemcpy(d_A, A, dsize, cudaMemcpyHostToDevice);
cudaMemcpy(d_B, B, dsize, cudaMemcpyHostToDevice);
dim3 dimBlock(BLOCK_DIM, BLOCK_DIM);
dim3 dimGrid((int)ceil((double)N/dimBlock.x),(int)ceil((double)N/dimBlock.y));
parMatrixAdd_kernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C, N);
cudaMemcpy(C, d_C, dsize, cudaMemcpyDeviceToHost);
for (int i = 0; i < N; i++)
for (int j = 0; j < N; j++)
if (C[i][j] != (tval1+tval2)) {printf("mismatch at %d,%d was: %f, should be %f\n", i, j, C[i][j], (tval1+tval2)); return 1;}
printf("Success!\n");
return 0;
}
$ nvcc -o t1002 t1002.cu
$ cuda-memcheck ./t1002
========= CUDA-MEMCHECK
Success!
========= ERROR SUMMARY: 0 errors
$