8

I am writing a simpled code about the addition of the elements of 2 matrices A and B; the code is quite simple and it is inspired on the example given in chapter 2 of the CUDA C Programming Guide.

#include <stdio.h>
#include <stdlib.h>

#define N 2

__global__ void MatAdd(int A[][N], int B[][N], int C[][N]){
           int i = threadIdx.x;
           int j = threadIdx.y;

           C[i][j] = A[i][j] + B[i][j];
       }


int main(){

int A[N][N] = {{1,2},{3,4}};
int B[N][N] = {{5,6},{7,8}};
int C[N][N] = {{0,0},{0,0}};    

int (*pA)[N], (*pB)[N], (*pC)[N];

cudaMalloc((void**)&pA, (N*N)*sizeof(int));
cudaMalloc((void**)&pB, (N*N)*sizeof(int));
cudaMalloc((void**)&pC, (N*N)*sizeof(int));

cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(pB, B, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(pC, C, (N*N)*sizeof(int), cudaMemcpyHostToDevice);

int numBlocks = 1;
dim3 threadsPerBlock(N,N);
MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C);

cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost);

int i, j; printf("C = \n");
for(i=0;i<N;i++){
    for(j=0;j<N;j++){
        printf("%d ", C[i][j]);
    }
    printf("\n");
}

cudaFree(pA); 
cudaFree(pB); 
cudaFree(pC);

printf("\n");

return 0;
}

when i run it i keep getting the initial matrix C = [0 0 ; 0 0] instead of the addition of the elements(i,j) of the 2 matrices A and B; i have previously done another example about the addition of the elements of two arrays and it seems to work fine; however this time i don't know why it does not work.

I believe there's something wrong with the cudaMalloc command by i don't really know what else could it be.

Any ideas?

njuffa
  • 23,970
  • 4
  • 78
  • 130
Federico Gentile
  • 5,650
  • 10
  • 47
  • 102
  • 4
    start by adding [proper cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) to your code. Your method of creating 2D matrices on the device won't work as-is. Because of the difficulty associated with creating 2D matrices on the device, it's frequently suggested that you avoid it and flatten your matrices to 1D, and use index/pointer arithmetic to simulate 2D access. (your pointer allocations for `pA`, etc. are basically 1D at the moment anyway.) – Robert Crovella Nov 03 '14 at 16:09
  • 3
    Your comment helped a lot mr. @JackOLantern – Federico Gentile Nov 03 '14 at 16:27
  • 2
    Could you try `MatAdd<<>>(pA,pB,pC);` ? – francis Nov 03 '14 at 17:10
  • @francis what you just wrote seems to be the correct answer!! however i still don't understand why the values contained in A, B and C shouldn't be mapped to the MatAdd function... – Federico Gentile Nov 03 '14 at 17:25
  • since matrix addition is just position by position, can't you just deal with a 1d array? – Grady Player Nov 03 '14 at 17:58

1 Answers1

11

MatAdd<<<numBlocks,threadsPerBlock>>>(pA,pB,pC); instead of MatAdd<<<numBlocks,threadsPerBlock>>>(A,B,C); solves the problem.

The reason is that A,B and C are allocated on the CPU, while pA,pB and pC are allocated of the GPU, using CudaMalloc(). Once pA,pB and pC are allocated, the values are sent from the CPU to GPU by cudaMemcpy(pA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);

Then, the addition is performed on the GPU, that is with pA,pB and pC. To use printf, the result pC is sent from the GPU to the CPU via cudaMemcpy(C, pC, (N*N)*sizeof(int), cudaMemcpyDeviceToHost);

Think as if the CPU cannot see pA and the GPU cannot see A.

francis
  • 9,525
  • 2
  • 25
  • 41