Adding two matrix in CUDA using two dimension threads

Question

I am new to CUDA. I have tried to add two vectors and it works fine. Now I want to add two matrix. I want to add two matrix using two dimension threads(threadIdx.x and threadIdx.y). I have found this code in Internet, and I have made some changes to display the results. It compiles. But displays unexpected results, it looks like memory addresses. Please help me, Thank you in advance.

#include <stdio.h>
#include <stdlib.h>

#define N 5
#define BLOCK_DIM 10

__global__ void matrixAdd (int *a, int *b, int *c) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    int index = col + row * N;

    if (col < N && row < N) {
        c[index] = a[index] + b[index];
    }

}

int main() {
    int a[N][N], b[N][N], c[N][N];
    int *dev_a, *dev_b, *dev_c;

    int size = N * N;

    for(int i=0; i<N; i++)
        for (int j=0; j<N; j++){
            a[i][j] = 1;
            b[i][j] = 2;
        }

    cudaMalloc((void**)&dev_a, size);
    cudaMalloc((void**)&dev_b, size);
    cudaMalloc((void**)&dev_c, size);

    cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

    dim3 dimBlock(BLOCK_DIM, BLOCK_DIM);
    dim3 dimGrid((int)ceil(N/dimBlock.x),(int)ceil(N/dimBlock.y));

    matrixAdd<<<dimGrid,dimBlock>>>(dev_a,dev_b,dev_c);
    cudaDeviceSynchronize();

    for(int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            printf("%d\t", c[i][j] );
        }
        printf("\n");
    }

    cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);

    cudaFree(dev_a); 
    cudaFree(dev_b); 
    cudaFree(dev_c);
}

and the output is

0   0   -780197879  32659   1   
0   452489360   32764   6303208 0   
4198328 0   452489376   32764   4198181 
0   2   0   4198557 0   
4196864 0   0   0   4198480

my expected output is a 5x5 matrix of element 3. Please help me.

score 4 · Accepted Answer · edited May 23 '17 at 11:44

You should always use proper cuda error checking and run your code with cuda-memcheck, any time you are having trouble with a CUDA code. I recommend doing that before you ask for help here. Even if you don't understand the error output, it will be useful for those trying to help you.

If you had done proper CUDA error checking, you would have been notified of an "invalid argument" error on the kernel launch. It is arising due to this calculation:
```
dim3 dimGrid((int)ceil(N/dimBlock.x),(int)ceil(N/dimBlock.y));
```
If you print out the actual computed values after that calculation:
```
printf("dimGrid.x = %d, dimGrid.y = %d\n", dimGrid.x, dimGrid.y);
```
You will find that they are both zero. That is illegal.

This construct:
```
N/dimBlock.x
```
is using integer division. With a value of N=5 and dimBlock.x = 10, the integer division of those two numbers is zero. The use of ceil does not affect this, the way you have written it.

There are many ways to fix it. One possible approach is to do the arithmetic like this:
```
dim3 dimGrid((N+dimBlock.x-1)/dimBlock.x, (N+dimBlock.y-1)/dimBlock.y);
```
The next error is in the calculation of your size variable:
```
int size = N * N;
```
cudaMalloc and cudaMemcpy, like malloc and memcpy, expect a size in bytes. So you should do this:
```
int size = N * N * sizeof(int);
```

Finally, you want to copy the data back to the host before you print it out. So this is not correct:

for(int i=0; i<N; i++){
    for (int j=0; j<N; j++){
        printf("%d\t", c[i][j] );
    }
    printf("\n");
}

cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);

You want to do this:

cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);
for(int i=0; i<N; i++){
    for (int j=0; j<N; j++){
        printf("%d\t", c[i][j] );
    }
    printf("\n");
}

Here is a worked example showing the above modifications to your code:

$ cat t1058.cu
#include <stdio.h>
#include <stdlib.h>

#define N 5
#define BLOCK_DIM 10

__global__ void matrixAdd (int *a, int *b, int *c) {
    int col = blockIdx.x * blockDim.x + threadIdx.x;
    int row = blockIdx.y * blockDim.y + threadIdx.y;

    int index = col + row * N;

    if (col < N && row < N) {
        c[index] = a[index] + b[index];
    }

}

int main() {
    int a[N][N], b[N][N], c[N][N];
    int *dev_a, *dev_b, *dev_c;

    int size = N * N * sizeof(int);

    for(int i=0; i<N; i++)
        for (int j=0; j<N; j++){
            a[i][j] = 1;
            b[i][j] = 2;
        }

    cudaMalloc((void**)&dev_a, size);
    cudaMalloc((void**)&dev_b, size);
    cudaMalloc((void**)&dev_c, size);

    cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
    cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

    dim3 dimBlock(BLOCK_DIM, BLOCK_DIM);
    //dim3 dimGrid((int)ceil(N/dimBlock.x),(int)ceil(N/dimBlock.y));
    dim3 dimGrid((N+dimBlock.x-1)/dimBlock.x, (N+dimBlock.y-1)/dimBlock.y);
    printf("dimGrid.x = %d, dimGrid.y = %d\n", dimGrid.x, dimGrid.y);
    matrixAdd<<<dimGrid,dimBlock>>>(dev_a,dev_b,dev_c);
    cudaDeviceSynchronize();
    cudaMemcpy(c, dev_c, size, cudaMemcpyDeviceToHost);

    for(int i=0; i<N; i++){
        for (int j=0; j<N; j++){
            printf("%d\t", c[i][j] );
        }
        printf("\n");
    }


    cudaFree(dev_a);
    cudaFree(dev_b);
    cudaFree(dev_c);
}
$ nvcc -o t1058 t1058.cu
$ cuda-memcheck ./t1058
========= CUDA-MEMCHECK
dimGrid.x = 1, dimGrid.y = 1
3 3 3 3 3
3 3 3 3 3
3 3 3 3 3
3 3 3 3 3
3 3 3 3 3
========= ERROR SUMMARY: 0 errors
$

Thank you very much. Since I am new, I was struggling with this over a month. Thank you for helping. Keep up the good work. We (beginners) need people like you , thanks again — Saahithyan Vigneswaran, Jan 20 '16 at 06:03

Adding two matrix in CUDA using two dimension threads

1 Answers1