2

The code below calculates the dot product of two vectors a and b. The correct result is 8192. When I run it for the first time the result is correct. Then when I run it for the second time the result is the previous result + 8192 and so on:

1st iteration: result = 8192
2nd iteration: result = 8192 + 8192
3rd iteration: result = 8192 + 8192 
and so on.

I checked by printing it on screen and the device variable dev_c is not freed. What's more writing to it causes something like a sum, the result beeing the previous value plus the new one being written to it. I guess that could be something with the atomicAdd() operation, but nonetheless cudaFree(dev_c) should erase it after all.

#define N 8192
#define THREADS_PER_BLOCK 512
#define NUMBER_OF_BLOCKS (N/THREADS_PER_BLOCK)
#include <stdio.h>


__global__ void dot( int *a, int *b, int *c ) {

    __shared__ int temp[THREADS_PER_BLOCK];

    int index = threadIdx.x + blockIdx.x * blockDim.x;

    temp[threadIdx.x] = a[index] * b[index];

    __syncthreads();

    if( 0 == threadIdx.x ) {

        int sum = 0;
        for( int i= 0; i< THREADS_PER_BLOCK; i++ ){
        sum += temp[i];
        }
        atomicAdd(c,sum);
    }
}

    int main( void ) {

        int *a, *b, *c;
        int *dev_a, *dev_b, *dev_c; 
        int size = N * sizeof( int); 

        cudaMalloc( (void**)&dev_a, size );
        cudaMalloc( (void**)&dev_b, size );
        cudaMalloc( (void**)&dev_c, sizeof(int));

        a = (int*)malloc(size);
        b = (int*)malloc(size);
        c = (int*)malloc(sizeof(int));

        for(int i = 0 ; i < N ; i++){
            a[i] = 1;
            b[i] = 1;
        }

        cudaMemcpy( dev_a, a, size, cudaMemcpyHostToDevice);
        cudaMemcpy( dev_b, b, size, cudaMemcpyHostToDevice);

        dot<<< N/THREADS_PER_BLOCK,THREADS_PER_BLOCK>>>( dev_a, dev_b, dev_c);

        cudaMemcpy( c, dev_c, sizeof(int) , cudaMemcpyDeviceToHost);

        printf("Dot product = %d\n", *c);

        cudaFree(dev_a);
        cudaFree(dev_b);
        cudaFree(dev_c);    

        free(a); 
        free(b); 
        free(c);

        return 0;

    }
ZviBar
  • 9,758
  • 6
  • 25
  • 30

2 Answers2

11

cudaFree doesn't erase anything, it simply returns memory to a pool to be re-allocated. cudaMalloc doesn't guarantee the value of memory that has been allocated. You need to initialize memory (both global and shared) that your program uses, in order to have consistent results. The same is true for malloc and free, by the way.

Community
  • 1
  • 1
Robert Crovella
  • 143,785
  • 11
  • 213
  • 257
  • I initialized the shared variable within the kernel and it works like charm. Thanks! – ZviBar Oct 27 '12 at 13:45
  • @ZviBar I am having the same issue. Can you please tell me how you initialized within the kernel ? Doesn't every thread reset that value if we initialize inside the kernel ? Thank you in advance. – Rajind Ruparathna Dec 14 '15 at 19:44
  • @RajindRuparathna that was over three years ago and I haven't touched Cuda since then. Sorry. – ZviBar Dec 15 '15 at 20:43
  • Doing a cudaMemcpy using a initialized array worked for me. cudaMemcpy(dev_c, c, size,cudaMemcpyHostToDevice); where c in an array initialized to zero using a for loop. – Rajind Ruparathna Dec 21 '15 at 12:43
6

From the documentation of cudaMalloc();

The memory is not cleared.

That means that dev_c is not initialized, and your atomicAdd(c,sum); will add to any random value that happens to be stored in memory at the returned position.

Joachim Isaksson
  • 176,943
  • 25
  • 281
  • 294