Find the sum reduction issue with size of thread in CUDA

Question

In previous post here, I asked about how to calculate sum of an array with reduction. Now I have a new problem, with larger image, my result is not correct, it change every time I run. I tested with 96*96 image size array sample

First time result: 28169.046875

Second time result: 28169.048828

Expected result: 28169.031250

Here is my code:

#include <stdio.h>
#include <cuda.h>

__global__  void calculate_threshold_kernel(float * input, float * output)
{

   int idx = blockIdx.x * blockDim.x + threadIdx.x;
   int t = threadIdx.x;
   __shared__ float partialSum[256];
   partialSum[t] = input[idx];
   __syncthreads();

   for (int stride = 1; stride < blockDim.x; stride *= 2)
   {
       if (t % (2 * stride) == 0)
           partialSum[t] += partialSum[t + stride];

       __syncthreads();
   }

   if (t == 0)
   {
       atomicAdd(output,partialSum[0]);
   }

}
int main( void )
{

    float *d_array, *d_output,*h_input, *h_output;
    int img_height = 96;
    int img_width = 96;
    int input_elements = img_height * img_width;

    h_input = (float*) malloc(sizeof(float) * input_elements);
    cudaMalloc((void**)&d_output, sizeof(float));
    cudaMemset(d_output, 0, sizeof(float));
    h_output = (float*)malloc(sizeof(float));
    cudaMalloc((void**)&d_array, input_elements*sizeof(float));

    float array[] = {[array sample]};
    for (int i = 0; i < input_elements; i++)
    {
        h_input[i] = array[i];
    }

    cudaMemcpy(d_array, h_input, input_elements*sizeof(float), cudaMemcpyHostToDevice);

        dim3 blocksize(256);
        dim3 gridsize(input_elements/blocksize.x);

        calculate_threshold_kernel<<<gridsize,blocksize>>>(d_array, d_output);

        cudaMemcpy(h_output, d_output, sizeof(float), cudaMemcpyDeviceToHost);

        printf("Sum from GPU = %f\n", *h_output);

    return 0;
}

Pavan Yalamanchili · Accepted Answer · 2016-08-02T07:10:29.520

While the answer from Kangshiyin is correct about floating point accuracy and floating point arithmetic being non-commutative, he is not correct about the reason behind the results differing from one run to the other.

Floating point arithmetic is non-commutative, this means operations performed in different order can return different results. For example (((a+b)+c)+d) may be slightly different than ((a+b)+(c+d)) for certain values of a,b,c and d. But both these results should not vary from run to run.
Your result vary between different runs because atomicAdd results in the order of additions being different. Using double also does not guarantee that the results will be the same between different runs.
There are ways to implement parallel reduction without atomicAdd as the final step (ex: use a second kernel launch to add partial sums from the first launch) which can provide consistent (yet slightly different from CPU) results.

Thanks for clarifying that. I will change my answer. – kangshiyin Aug 01 '16 at 12:55 — kangshiyin, Aug 01 '16 at 12:55

kangshiyin · Answer 2 · 2016-08-01T12:52:15.090

1

float has a limited precision up to 7 demical digits as explained here.

https://en.wikipedia.org/wiki/Floating_point#Accuracy_problems

~~The result changes because operations on float are non-commutative and you are using parallel reduction.~~

The result changes because operations on float are non-commutative and you are using atomicAdd(), which can not keep the order of additions.

You could use double instead if you want more accurate result.

edited Aug 01 '16 at 12:52

answered Aug 01 '16 at 07:35

kangshiyin

9,681
1
17
29

Thanks you, but as I know, the **atomicAdd()** not support **double** right ?? – Nguyễn Cường Aug 01 '16 at 07:55
There's an implementation in cuda documentation, which uses `atomicCAS()`. You could use it. http://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#atomic-functions – kangshiyin Aug 01 '16 at 08:21

Find the sum reduction issue with size of thread in CUDA

2 Answers2