CUDA Array-Vector multiply

Question

Hi i am making my first steps in CUDA technology but i think i do not get it right.

I am trying to multiply two dimensional array by vector but something is not working

Here is the code I am trying to figure out:

#include <stdio.h>
#include <stdlib.h>

#define N 2

__global__ void Multiply(int A[N][N], int B[N], int C[N]){
           int i = threadIdx.x;
           int j = threadIdx.y;

           int sum = A[i][j] * B[j];
           C[i]= sum;
           printf("%d,%d ", sum, C[i]);
}

int main(){

int A[N][N] ={  {1,1},
                {1,1}
            };
int B[N] = {4,6};
int C[N] = {0,0};    
int (*aA)[N], (*aB), (*aC);

cudaMalloc((void**)&aA, (N*N)*sizeof(int));
cudaMalloc((void**)&aB, (N)*sizeof(int));
cudaMalloc((void**)&aC, (N)*sizeof(int));

cudaMemcpy(aA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(aB, B, (N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(aC, C, (N)*sizeof(int), cudaMemcpyHostToDevice);

int numBlocks = 1;
dim3 threadsPerBlock(N,N);
Multiply<<<numBlocks,threadsPerBlock>>>(aA,aB,aC);

cudaMemcpy(C, aC, (N)*sizeof(int), cudaMemcpyDeviceToHost);


cudaFree(aA); 
cudaFree(aB); 
cudaFree(aC);

printf("\n");
system("pause");

}

in this case the Output is : 4,6 4,6 6,6 6,6 so basically the sum i giving the right values but C[i] is returning always 6 although there is sum value assigned to it.

What am I doing wrong?

score 1 · Accepted Answer · edited May 23 '17 at 11:51

Any time you are having trouble with a CUDA code, it's a good idea to use proper cuda error checking and run your code with cuda-memcheck. That's just a boiler-plate statement I make. It wouldn't actually turn up issues with the code you have shown in this case.
As was pointed out already in an answer now deleted, you are not actually summing anything together. Even though you have a variable named sum, it is not actually the sum of anything, and you have no + or summation operations in your kernel code. You are not writing a kernel that will sum anything together.
To produce a correct result, your kernel depends on cooperatively having multiple threads update a single location (C[i]). However, this requires some coordination between threads. Without any coordination, you will have threads in a race condition with each other, and the results will be unpredictable. We could sort this out using a parallel reduction, to sum together partial-products from each of the individual threads, or for simplicity we could use an atomicAdd operation, which will force threads to update (add to) C[i] one-by-one, so they don't step on each other. Using atomicAdd therefore also supplies the necessary addition (+) operation, which is lacking in your kernel.

Here's a worked code with items 2 and 3 addressed. You can run it with cuda-memcheck to verify behavioral correctness even though it has no explicit error checking:

 $ cat t1037.cu
#include <stdio.h>
#include <stdlib.h>

#define N 2

__global__ void Multiply(int A[N][N], int B[N], int C[N]){
           int i = threadIdx.x;
           int j = threadIdx.y;

           int product = A[i][j] * B[j];
           atomicAdd(C+i, product);
  //         printf("%d,%d ", product, C[i]);
}

int main(){

int A[N][N] ={  {1,1},
                {1,1}
            };
int B[N] = {4,6};
int C[N] = {0,0};
int (*aA)[N], (*aB), (*aC), i;

cudaMalloc((void**)&aA, (N*N)*sizeof(int));
cudaMalloc((void**)&aB, (N)*sizeof(int));
cudaMalloc((void**)&aC, (N)*sizeof(int));

cudaMemcpy(aA, A, (N*N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(aB, B, (N)*sizeof(int), cudaMemcpyHostToDevice);
cudaMemcpy(aC, C, (N)*sizeof(int), cudaMemcpyHostToDevice);

int numBlocks = 1;
dim3 threadsPerBlock(N,N);
Multiply<<<numBlocks,threadsPerBlock>>>(aA,aB,aC);

cudaMemcpy(C, aC, (N)*sizeof(int), cudaMemcpyDeviceToHost);

for (i=0; i<N; i++){
  printf("C[%d] = %d\n",i,C[i]);
  }
cudaFree(aA);
cudaFree(aB);
cudaFree(aC);

printf("\n");

}
$ nvcc -o t1037 t1037.cu
$ cuda-memcheck ./t1037
========= CUDA-MEMCHECK
C[0] = 10
C[1] = 10

========= ERROR SUMMARY: 0 errors
$

CUDA Array-Vector multiply

1 Answers1