CUDA Dot Product

Question

I'm trying to implement the classic dot-product kernel for double precision arrays with atomic computation of the final sum across the various blocks. I used the atomicAdd for double precision as stated in page 116 of the programming guide.Probably i'm doing something wrong.The partial sums across the threads in every block are computed correctly but afterwords the atomic operation doesn't seem to be working properly since every time i run my kernel with the same data,i receive different results. I'll be grateful if somebody could spot the mistake or provide an alternative solution! Here is my kernel:

__global__ void cuda_dot_kernel(int *n,double *a, double *b, double *dot_res)
{
    __shared__ double cache[threadsPerBlock]; //thread shared memory
    int global_tid=threadIdx.x + blockIdx.x * blockDim.x;
    int i=0,cacheIndex=0;
    double temp = 0;
    cacheIndex = threadIdx.x;
    while (global_tid < (*n)) {
        temp += a[global_tid] * b[global_tid];
        global_tid += blockDim.x * gridDim.x;
    }
    cache[cacheIndex] = temp;
    __syncthreads();
    for (i=blockDim.x/2; i>0; i>>=1) {
        if (threadIdx.x < i) {
            cache[threadIdx.x] += cache[threadIdx.x + i];
        }
        __syncthreads();
    }
    __syncthreads();
    if (cacheIndex==0) {
        *dot_res=cuda_atomicAdd(dot_res,cache[0]);
    }
}

And here is my device function atomicAdd:

__device__ double cuda_atomicAdd(double *address, double val)
{
    double assumed,old=*address;
    do {
        assumed=old;
        old= __longlong_as_double(atomicCAS((unsigned long long int*)address,
                    __double_as_longlong(assumed),
                    __double_as_longlong(val+assumed)));
    }while (assumed!=old);

    return old;
}

Shared memory atomics are pretty slow. This is not a good way to implement a dot product. You are better off using Thrust, as Jared points out. If you insist on writing your own code, and you really want to do it in a single kernel, see the threadFenceReduction sample in the CUDA SDK code samples. It should be much more efficient (it's not a dot product, just a sum reduction, but adding the initial element-wise multiply should be trivial.) — harrism, Feb 26 '12 at 11:25
@harrism: Where are there shared memory atomics in this code? This is just a standard shared memory reduction with global memory atomic operations to complete summation of the blockwise partial reduced values. — talonmies, Feb 26 '12 at 11:52
Sorry, I transposed the atomic arguments in my head! Regardless, you shouldn't need atomics to implement reduction in a single kernel if you use threadfence. — harrism, Feb 26 '12 at 22:17

Jared Hoberock · Answer 1 · 2019-02-06T22:30:25.720

Getting a reduction right using ad hoc CUDA code can be tricky, so here's an alternative solution using a Thrust algorithm, which is included with the CUDA Toolkit:

#include <thrust/inner_product.h>
#include <thrust/device_ptr.h>

double do_dot_product(int n, double *a, double *b)
{
  // wrap raw pointers to device memory with device_ptr
  thrust::device_ptr<double> d_a(a), d_b(b);

  // inner_product implements a mathematical dot product
  return thrust::inner_product(d_a, d_a + n, d_b, 0.0);
}

talonmies · Accepted Answer · 2012-02-26T10:50:20.873

4

You are using the cuda_atomicAdd function incorrectly. This section of your kernel:

if (cacheIndex==0) {
    *dot_res=cuda_atomicAdd(dot_res,cache[0]);
}

is the culprit. Here, you atomically add to dot_res. then non atomically set dot_res with the result it returns. The return result from this function is the previous value of the location being atomically updated, and it supplied for "information" or local use of the caller only. You don't assign it to what you are atomically updated, that completely defeats the purpose of using atomic memory access in the first place. Do something like this instead:

if (cacheIndex==0) {
    double result=cuda_atomicAdd(dot_res,cache[0]);
}

edited Feb 26 '12 at 10:50

answered Feb 26 '12 at 08:10

talonmies

70,661
34
192
269

Thank you for your reply..Since the global variable *dot_res is initialized to 0,i will then have gridDim.x blocks having a local variable "result" containing the same value as the shared variable cache[0] right(result=cache[0]+*dot_res=cache[0])?If i understood correctly,there would be no final reduction this way..Is there a way to finish up the reduction on the device?I tried using the mutex example from the cuda by Example but it seems to produce a deadlock. – tim_chemeng Feb 26 '12 at 09:59
I am not sure I understand what you are asking. If you just make the change I showed, I believe it should work as you imagine and the reduction should be completed. The atomicCAS loop should just hammer away until each calling thread's contribution has been registered in the global total. Because you are probably only running something between 10 & 100 blocks, there shouldn't be too much contention for `dot_res` and it should work OK. – talonmies Feb 26 '12 at 10:12
I'm asking about the variable result.This variable has local scope right?Only threads with cacheIndex=0 can view their exclusive copy of this variable and modify it?So how am i going to globally, across all blocks produce only 1 result variable containing the partial sums of all blocks? – tim_chemeng Feb 26 '12 at 10:19
1

`result` is irrelevant. You don't need it make the code work. You could rewrite the atomic add to not return it if you wanted to. Its value is the __previous__ value of `dot_res`, not the new value.The atomic add function is updating `dot_res` itself internally, that is where the dot product is stored. – talonmies Feb 26 '12 at 10:28
Is there any chance that the rounding error in the Atomic operation affects significantly the final result?Because the results i get when i perform the final sum with atomicAdd are slightly different than the ones i got performing the serial CPU final sum.Furthermore there isn't any reproducability in the result taken via the atomic operation. – tim_chemeng Feb 26 '12 at 14:11
I very, very much doubt it. It is to be expected that there will be a small difference between results computed serially on the CPU and in parallel on the GPU - floating point arithmetic is not [associative](http://en.wikipedia.org/wiki/Floating_point#Accuracy_problems). You probably have a problem with the host code you are using -- I built a full repro case with your kernel and it works exactly as expected. – talonmies Feb 26 '12 at 14:32

djmj · Answer 3 · 2012-03-05T17:42:12.507

Did not checked your code that depth but here are some advices.
I would only advice using Thrust if you only use your GPU for such generic tasks, since if a complex problem will arise people have no idea to efficiently program parallel on the gpu.

Start a new parallel reduction kernel to summarize the dot product.
Since the data is already on the device you won't see a decrease in performance starting a new kernel.
Your kernel seems not to scale across the maximum number of possible blocks on the newest GPU. If it would and your kernel would be able to calculate the dot product of millions of values the performance would decrease dramatically because of the serialized atomic operation.
Beginner mistake: Is your input data and shared memory access range checked? Or are you sure the input data is always multiple of your block size? Else you will read garbage. Most of my wrong results were due to this fault.
optimise your parallel reduction. My Thesis or Optimisations Mark Harris

Untested, i just wrote it down in notepad:

/*
 * @param inCount_s unsigned long long int Length of both input arrays
 * @param inValues1_g double* First value array
 * @param inValues2_g double* Second value array
 * @param outDots_g double* Output dots of each block, length equals the number of blocks
 */
__global__ void dotProduct(const unsigned long long int inCount_s,
    const double* inValuesA_g,
    const double* inValuesB_g,
    double* outDots_g)
{
    //get unique block index in a possible 3D Grid
    const unsigned long long int blockId = blockIdx.x //1D
            + blockIdx.y * gridDim.x //2D
            + gridDim.x * gridDim.y * blockIdx.z; //3D


    //block dimension uses only x-coordinate
    const unsigned long long int tId = blockId * blockDim.x + threadIdx.x;

    /*
     * shared value pair products array, where BLOCK_SIZE power of 2
     *
     * To improve performance increase its size by multiple of BLOCK_SIZE, so that each threads loads more then 1 element!
     * (outDots_g length decreases by same factor, and you need to range check and initialize memory)
     * -> see harris gpu optimisations / parallel reduction slides for more informations.
     */
    __shared__ double dots_s[BLOCK_SIZE];


    /*
     * initialize shared memory array and calculate dot product of two values, 
     * shared memory always needs to be initialized, its never 0 by default, else garbage is read later!
     */
    if(tId < inCount_s)
        dots_s[threadIdx.x] = inValuesA_g[tId] * inValuesB_g[tId];
    else
        dots_s[threadIdx.x] = 0;
    __syncthreads();

    //do parallel reduction on shared memory array to sum up values
    reductionAdd(dots_s, dots_s[0]) //see my thesis link

    //output value
    if(threadIdx.x == 0)
        outDots_g[0] = dots_s[0];

    //start new parallel reduction kernel to sum up outDots_g!
}

Edit: removed unnecessary points.

1. "The kernel should be run with just enough blocks to fill every SM in the GPU." Who said it should not run with just enough blocks? I said the kernel itself should be scalable across the maximum number of blocks! 2. Regarding this simple kernel there is no need for any stride. Simplest coalesced read pattern applies here: http://developer.download.nvidia.com/compute/cuda/2_0/docs/NVIDIA_CUDA_Programming_Guide_2.0.pdf Figure 5-1 — djmj, Feb 26 '12 at 15:59
2. "Point #5 is also wrong." Basic c knowledge. Don't read outside your pointer lengths. You will just read whatever is on that memory address. For shared memory: http://stackoverflow.com/questions/6478098/is-there-a-way-of-setting-default-value-for-shared-memory-array — djmj, Feb 26 '12 at 16:01
Point #3 is still not applicable. Perhaps you don't understand what the code does, but it __has__ implicit global memory range checking in the accumulation loop. — talonmies, Feb 26 '12 at 16:09
@talonmies "Beginner mistake: Is your input data and shared memory access range checked?" That's why I asked ;) If the code is not commented for us why should someone take his time to search an error, so I just give basic advices. His original question included "... or provide an alternative solution!" Thats what I did and with a better performance. — djmj, Feb 26 '12 at 16:22

CUDA Dot Product

3 Answers3