Large Matrix Addition using CUDA C

Question

I want to add two large matrices NxN (N is multiple of two) and parallelize program using Cuda C. I was able to run the program with matrices of size 512x512. But if I go beyond that (e.g: 1024x1024) then it fails. I believe the problem might be that CUDA can launch maximum 512 threads per block(?). So my question is how to change the program so that I can matrices of any size.

cuda kernel

__global__ void parMatrixAdd_kernel(float *a, float *b, float *c, int N) {

   int col = threadIdx.x + blockIdx.x * blockDim.x;
   int row = threadIdx.y + blockIdx.y * blockDim.y;
   int index = col + row * N;
   if (col < N && row < N) {
       c[index] = a[index] + b[index];
   }
}

block and grid sizes:

    //BLOCK_DIM is 512, N works upto 512, but not beyond
    dim3 dimBlock(BLOCK_DIM, BLOCK_DIM);
    dim3 dimGrid((int)ceil(N/dimBlock.x),(int)ceil(N/dimBlock.y));

arrays are: matrix1[N][N] matrix2[N][N]

score 4 · Accepted Answer · edited Dec 15 '20 at 17:44

Any time you are having trouble with a CUDA code, it's advisable to run your code with cuda-memcheck and also add proper cuda error checking.
This:
```
 dim3 dimBlock(BLOCK_DIM, BLOCK_DIM);
```

is not legal in CUDA for BLOCK_DIM larger than 22 or 32 (depending on GPU and CUDA version). CUDA kernels are limited to 512 or 1024 threads per block, which is the product of the individual threadblock dimensions. So in your case BLOCK_DIM*BLOCK_DIM must be less than 512 or 1024 depending on GPU and CUDA version. Setting BLOCK_DIM to 512 cannot work, in any case.

If you are creating a variable like this on the stack:
```
 float matrix1[N][N];
```

that will cause trouble as N gets larger, because you may run into limits on stack size. (This is not related to CUDA.) Instead create these variables dynamically (an example is given in the code below).

The following code (built around the pieces you have shown) seems to work for me, and implements the above changes. I've omitted proper cuda error check for brevity of presentation, but I recommend using it if you are having trouble with a CUDA code. In lieu of that, I am running it with cuda-memcheck :

$ cat t1002.cu
#include <stdio.h>
#include <math.h>

const size_t BLOCK_DIM = 16;
const size_t MY_N =  2048;

const float tval1 = 1.0f;
const float tval2 = 2.0f;

__global__ void parMatrixAdd_kernel(float *a, float *b, float *c, int N) {

   int col = threadIdx.x + blockIdx.x * blockDim.x;
   int row = threadIdx.y + blockIdx.y * blockDim.y;
   int index = col + row * N;
   if (col < N && row < N) {
       c[index] = a[index] + b[index];
   }
}

typedef float my_mat[MY_N];

int main(){

  my_mat *A, *B, *C;
  const size_t N = MY_N;
  size_t dsize = N*N*sizeof(float);
  A = (my_mat *)malloc(dsize);
  B = (my_mat *)malloc(dsize);
  C = (my_mat *)malloc(dsize);

  for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++) {
      A[i][j] = tval1;
      B[i][j] = tval2;
      C[i][j] = 0.0f;}

  float *d_A, *d_B, *d_C;
  cudaMalloc(&d_A, dsize);
  cudaMalloc(&d_B, dsize);
  cudaMalloc(&d_C, dsize);

  cudaMemcpy(d_A, A, dsize, cudaMemcpyHostToDevice);
  cudaMemcpy(d_B, B, dsize, cudaMemcpyHostToDevice);

  dim3 dimBlock(BLOCK_DIM, BLOCK_DIM);
  dim3 dimGrid((int)ceil((double)N/dimBlock.x),(int)ceil((double)N/dimBlock.y));

  parMatrixAdd_kernel<<<dimGrid, dimBlock>>>(d_A, d_B, d_C, N);
  cudaMemcpy(C, d_C, dsize, cudaMemcpyDeviceToHost);
  for (int i = 0; i < N; i++)
    for (int j = 0; j < N; j++)
      if (C[i][j] != (tval1+tval2)) {printf("mismatch at %d,%d was: %f, should be %f\n", i, j, C[i][j], (tval1+tval2)); return 1;}
  printf("Success!\n");
  return 0;
}
$ nvcc -o t1002 t1002.cu
$ cuda-memcheck ./t1002
========= CUDA-MEMCHECK
Success!
========= ERROR SUMMARY: 0 errors
$

Large Matrix Addition using CUDA C

1 Answers1