Cuda Summation per block. I get 0 returned to the sums. What is wrong?

Question

I tried summation in cuda . I cant find what i did wrong here. The sum is always returned 0. Can anyone help.

The shared tag defines the variable common in each block. So i tried to sum one block at a time and finally sum up the result for overall sum.

But the sum doesnt work for block. And i am stuck. Can anyone help.

    #include <stdio.h>
    #include <iostream>
    #include <cuda.h>
    #include <stdlib.h>
    //#define BLOCK_SIZE 32         // size of vectors


    __global__ void add( float * i_data, float * sum){
        int tid = blockIdx.x *  blockDim.x + threadIdx.x;

         __shared__ float s_data;

        s_data = 0;
// must be synchronized

        __syncthreads();

// reduce and sum

// typical in GPU computings

        for (int i = 0; i<blockDim.x; i++)
        {
            __syncthreads();

            if (tid <= i)

            {
                //s_data[blockIdx.x]+ = s_data[tid] + s_data[i+tid];

                s_data+= i_data[tid];

            }
        }
        if (tid == 0)
            sum[blockIdx.x]=s_data;

    }

    int main()  {
        int T = 10, B = 5;                  // threads per block and blocks per         grid
        float  *a,*b;               // host pointers
        float *dev_a, *dev_b;       // device pointers to host memory
        int sizeIN = T*B*sizeof(int);
        int sizeOUT = B*sizeof(int);

        a= new float[T*B];
        b= new float[B];

        for(int i = 0;i<B;i++)
        {
            for (int j=0;j<T;j++)
            {
                a[i*T+j]=i;
            }
        }
        for(int i = 0;i<B;i++)
        {
            b[i]=0;
        }




        cudaMalloc((void **) &dev_a, sizeIN);
        cudaMalloc((void **) &dev_b, sizeOUT);

        cudaMemcpy(dev_a, a, sizeIN, cudaMemcpyHostToDevice);
        cudaMemcpy(dev_b, b, sizeOUT, cudaMemcpyHostToDevice);

        add<<< B, T >>> (dev_a, dev_b);

        cudaMemcpy(a,dev_a,  sizeIN, cudaMemcpyDeviceToHost);
        cudaMemcpy(b,dev_b,  sizeOUT, cudaMemcpyDeviceToHost);

        for(int i = 0;i<B;i++)
        {
            for (int j=0;j<T;j++)
            {
                std::cout<< a[i*T+j]<<"\t";
                std::cout<<std::endl;
            }
            std::cout<<std::endl<<std::endl<<"sum is: "<<b[i]<<std::endl;
        }



        std::cout<<std::endl<<std::endl;

        cudaFree(dev_a);
        cudaFree(dev_b);
        free(a);
        free(b);
        return 0;
    }

score 1 · Accepted Answer · edited May 23 '17 at 11:58

This is wrong in 2 ways:

    if (tid = 0)

First, you should be doing a comparison == not an assignment =. I don't know why your compiler didn't warn you about this.

Second, tid is only zero for one thread in the entire grid:

    int tid = blockIdx.x *  blockDim.x + threadIdx.x;

You want one thread in each block to write the block result out to global memory:

    if (threadIdx.x == 0)

This is also a problem, similarly:

        if (tid <= i)

This is only satisfied for threads in the first block. Beyond that, I have to start to guess at what you want. I guess you're trying to sum the values in each block. Your construction is not a parallel reduction, but to make the minimum changes to get it "functional" I would rewrite the end of your kernel like this:

// reduce and sum

// typical in GPU computings

    for (int i = 0; i<blockDim.x; i++)
    {
        if (threadIdx.x == i)

        {
            //s_data[blockIdx.x]+ = s_data[tid] + s_data[i+tid];

            s_data+= i_data[tid];

        }
        __syncthreads();
    }
    if (threadIdx.x == 0)
        sum[blockIdx.x]=s_data;

}

Although you didn't have any CUDA API errors, it's good practice to use proper cuda error checking and also run your code with cuda-memcheck any time you are having trouble with a cuda code.

I mentioned that your code above is not a classical reduction. Its just an unoptimal for-loop.

To learn about a CUDA parallel reduction, study the cuda sample code and the accompanying presentation, and there are many examples here on the CUDA tag on SO as well that you can search on.

exactly, changing the kernel to this worked. Silly mistakes __global__ void add( float * i_data, float * sum){ int tid = blockIdx.x * blockDim.x + threadIdx.x; __shared__ float s_data; s_data = 0; // must be synchronized __syncthreads(); // reduce and sum // typical in GPU computings for (int i = 0; i — Roshan, May 07 '14 at 03:20

Cuda Summation per block. I get 0 returned to the sums. What is wrong?

1 Answers1