Dot Product in CUDA using atomic operations - getting wrong results

Question

I am trying to implement the dot product in CUDA and compare the result with what MATLAB returns. My CUDA code (based on this tutorial) is the following:

#include <stdio.h>

#define N (2048 * 8)
#define THREADS_PER_BLOCK 512
#define num_t float

// The kernel - DOT PRODUCT
__global__ void dot(num_t *a, num_t *b, num_t *c) 
{
  __shared__ num_t temp[THREADS_PER_BLOCK];
  int index = threadIdx.x + blockIdx.x * blockDim.x;
  temp[threadIdx.x] = a[index] * b[index];
  __syncthreads(); //Synchronize!
  *c = 0.00;
  // Does it need to be tid==0 that
  // undertakes this task?
  if (0 == threadIdx.x) {
    num_t sum = 0.00;
    int i;
    for (i=0; i<THREADS_PER_BLOCK; i++)
      sum += temp[i];
    atomicAdd(c, sum);        
    //WRONG: *c += sum; This read-write operation must be atomic!
  }
}


// Initialize the vectors:
void init_vector(num_t *x)
{
  int i;
  for (i=0 ; i<N ; i++){
    x[i] = 0.001 * i;
  }
}

// MAIN
int main(void)
{
  num_t *a, *b, *c;
  num_t *dev_a, *dev_b, *dev_c;
  size_t size = N * sizeof(num_t);

  cudaMalloc((void**)&dev_a, size);
  cudaMalloc((void**)&dev_b, size);
  cudaMalloc((void**)&dev_c, size);

  a = (num_t*)malloc(size);
  b = (num_t*)malloc(size);
  c = (num_t*)malloc(size);

  init_vector(a);
  init_vector(b);

  cudaMemcpy(dev_a, a, size, cudaMemcpyHostToDevice);
  cudaMemcpy(dev_b, b, size, cudaMemcpyHostToDevice);

  dot<<<N/THREADS_PER_BLOCK, THREADS_PER_BLOCK>>>(dev_a, dev_b, dev_c);

  cudaMemcpy(c, dev_c, sizeof(num_t), cudaMemcpyDeviceToHost);

  printf("a = [\n");
  int i;
  for (i=0;i<10;i++){
    printf("%g\n",a[i]);
  }
  printf("...\n");
  for (i=N-10;i<N;i++){
    printf("%g\n",a[i]);
  }
  printf("]\n\n");
  printf("a*b = %g.\n", *c);


  free(a); free(b); free(c);

  cudaFree(dev_a);
  cudaFree(dev_b);
  cudaFree(dev_c);

}

and I compile it with:

/usr/local/cuda-5.0/bin/nvcc -m64  -I/usr/local/cuda-5.0/include -gencode arch=compute_20,code=sm_20 -o multi_dot_product.o -c multi_dot_product.cu
g++ -m64 -o multi_dot_product multi_dot_product.o -L/usr/local/cuda-5.0/lib64 -lcudart

Information about my NVIDIA cards can be found at http://pastebin.com/8yTzXUuK. I tried to verify the result in MATLAB using the following simple code:

N = 2048 * 8;
a = zeros(N,1);
for i=1:N
    a(i) = 0.001*(i-1);
end

dot_product = a'*a;

But as N increases, I'm getting significantly different results (For instance, for N=2048*32 CUDA reutrns 6.73066e+07 while MATLAB returns 9.3823e+07. For N=2048*64 CUDA gives 3.28033e+08 while MATLAB gives 7.5059e+08). I incline to believe that the discrepancy stems from the use of float in my C code, but if I replace it with double the compiler complains that atomicAdd does not support double parameters. How should I fix this problem?

Update: Also, for high values of N (e.g. 2048*64), I noticed that the result returned by CUDA changes at every run. This does not happen if N is low (e.g. 2048*8).

At the same time I have a more fundamental question: The variable temp is an array of size THREADS_PER_BLOCK and is shared between threads in the same block. Is it also shared between blocks or every block operates on a different copy of this variable? Should I think of the method dot as instructions to every block? Can someone elaborate on how exactly the jobs are split and how the variables are shared in this example

Every block operates on a different copy of any given `__shared__` variable, including your `temp`. You can do `double` atomicAdd using the method outlined [here](http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions). — Robert Crovella, Apr 04 '13 at 22:10
The fact that you have this line: `*c = 0.00;` unprotected by any synchronization in your kernel code is messing you up. Initialize `c` to zero before you call your kernel, and eliminate that line from your kernel. Every thread of every block, whenever they happen to execute, is zeroing out `c` as you have written it. — Robert Crovella, Apr 04 '13 at 22:27

score 2 · Accepted Answer · edited May 23 '17 at 12:21

Comment this line out of your kernel:

//   *c = 0.00;

And add these lines to your host code, before the kernel call (after the cudaMalloc of dev_c):

num_t h_c = 0.0f;
cudaMemcpy(dev_c, &h_c, sizeof(num_t), cudaMemcpyHostToDevice);

And I believe you'll get results that match matlab, more or less.

The fact that you have this line in your kernel unprotected by any synchronization is messing you up. Every thread of every block, whenever they happen to execute, is zeroing out c as you have written it.

By the way, we can do significantly better with this operation in general by using a classical parallel reduction method. A basic (not optimized) illustration is here. If you combine that method with your usage of shared memory and a single atomicAdd at the end (one atomicAdd per block) you'll have a significantly improved implementation. Although it's not a dot product, this example combines those ideas.

Edit: responding to a question below in the comments:

A kernel function is the set of instructions that all threads in the grid (all threads associated with a kernel launch, by definition) execute. However, it's reasonable to think of execution as being managed by threadblock, since the threads in a threadblock are executing together to a large extent. However, even within a threadblock, execution is not in perfect lockstep across all threads, necessarily. Normally when we think of lockstep execution, we think of a warp which is a group of 32 threads in a single threadblock. Therefore, since execution amongst warps within a block can be skewed, this hazard was present even for a single threadblock. However, if there were only one threadblock, we could have gotten rid of the hazard in your code using appropriate sync and control mechanisms like __syncthreads() and (if threadIdx.x == 0) etc. But these mechanisms are useless for the general case of controlling execution across multiple threadsblocks. Multiple threadblocks can execute in any order. The only defined sync mechanism across an entire grid is the kernel launch itself. Therefore to fix your issue, we had to zero out c prior to the kernel launch.

Damn, right! So, initialization of variables should be done in a protected/synchronized manner (as I did for dev_a and dev_b). One more question to make sure I'm getting this right: A kernel function should be perceived as a bunch of instructions to be executed in a block, right? So, this is why we use `temp[threadIdx.x]`. — Pantelis Sopasakis, Apr 05 '13 at 12:59

Dot Product in CUDA using atomic operations - getting wrong results

1 Answers1

Linked