0

Relying on NVIDIA's samples and on some good advice found here at SO, I have been managing to implement a few array-reduction kernels that I need for my project. However, one particular issue remains causing me trouble. It is, how to properly do sum-reduction for arrays of unsigned chars (uchar).

Because uchar can hold values from 0 to 255, of course the thread blocks can't accumulate a value greater than 255 per thread block. My intuition was that it would be merely a case of collecting the sums inside the sum-reduction function in an int despite the input being uchar. However, it does not work.

Let me show in detail what I have. Below is my kernel to sum-reduce an array of uchar - it is a slighly modified version of the famous reduce6 function in NVIDIA's samples:

template <class T, unsigned int blockSize>
__global__ void reduce6(int n, T *g_idata, int *g_odata)
{
    extern __shared__ T sdata[];

    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockSize * 2 + threadIdx.x;
    unsigned int gridSize = blockSize * 2 * gridDim.x;

    int mySum = 0;

    // we reduce multiple elements per thread.  The number is determined by the
    // number of active thread blocks (via gridDim).  More blocks will result
    // in a larger gridSize and therefore fewer elements per thread
    while (i < n)
    {
        mySum += g_idata[i];
        // ensure we don't read out of bounds
        if (i + blockSize < n) mySum += g_idata[i + blockSize];
        i += gridSize;
    }

    // each thread puts its local sum into shared memory
    sdata[tid] = mySum;
    __syncthreads();


    // do reduction in shared mem
    if ((blockSize >= 512) && (tid < 256))
        sdata[tid] = mySum = mySum + sdata[tid + 256];
    __syncthreads();

    if ((blockSize >= 256) && (tid < 128))
        sdata[tid] = mySum = mySum + sdata[tid + 128];
     __syncthreads();

    if ((blockSize >= 128) && (tid <  64))
        sdata[tid] = mySum = mySum + sdata[tid + 64];
    __syncthreads();

    // fully unroll reduction within a single warp
    if ((blockSize >= 64) && (tid < 32))
        sdata[tid] = mySum = mySum + sdata[tid + 32];
    __syncthreads();

    if ((blockSize >= 32) && (tid < 16))
        sdata[tid] = mySum = mySum + sdata[tid + 16];
    __syncthreads();

    if ((blockSize >= 16) && (tid <  8))
        sdata[tid] = mySum = mySum + sdata[tid + 8];
    __syncthreads();

    if ((blockSize >= 8) && (tid <  4))
        sdata[tid] = mySum = mySum + sdata[tid + 4];
    __syncthreads();

    if ((blockSize >= 4) && (tid <  2))
        sdata[tid] = mySum = mySum + sdata[tid + 2];
    __syncthreads();

    if ((blockSize >= 2) && (tid <  1))
        mySum += sdata[tid + 1];
    __syncthreads();

    // write result for this block to global mem
    if (tid == 0)  atomicAdd(g_odata, mySum);
}

When such kernel is called by using reduce6<uchar, Blocksize> such that Blocksize*num.threads = 256, everything works properly and the sum-reduction gets the right result. Whenever such ratio is not 256, the result of the sum-reduction becomes wrong - which is merely due to what I said in the bebinning, i.e. uchar can't acumulate values greater than 255.

To me, the intuitive solution would be to simply change the line:

extern __shared__ T sdata[];

To:

extern __shared__ int sdata[];

Since sdata is a shared array created within the sum-reduction kernel, I thought that it could be of any type and thus properly accumulate whatever values result from the thread-block summation. Maybe, to make it sure, I even wrote the while loop with an explicit conversion of the income data into int:

    while (i < n)
    {
        mySum += (int)g_idata[i];
        // ensure we don't read out of bounds
        if (i + blockSize < n) mySum += (int)g_idata[i + blockSize];
        i += gridSize;
    }

However, to my surprise, all tha only makes the sum-reduction result to be always zero.

What am I missing? How could I alter such kernel to make it so that the uchar array being passed can be properly sum-reduced with arbitrary number of thread-blocks and threads?

If needed, a full example code can be found at: http://pastebin.com/nq1VRJCs

RAs
  • 377
  • 3
  • 13
  • Add [error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) - your kernel simply does not run for a blocksize larger than 256. – tera Mar 02 '17 at 08:14
  • When you changed the data type used in shared memory, did you also change the formula to calculate the dynamical shared memory size? – tera Mar 02 '17 at 08:14
  • @tera Which formula? If I understand correctly, you are mixing things up. I didn't change the data type that is allocated to the device. I only tried changing the data type of the shared array that is created inside the kernel as a buffer to hold the results of summation before the result is returned. – RAs Mar 02 '17 at 08:30
  • 1
    Yes I get that. But if you change the type of the shared memory array, you also need to change the type argument to the `sizeof` operator used to compute it's size. – tera Mar 02 '17 at 08:39

2 Answers2

2

The problem is most likely in code you have not shown here:

int smemSize = (threads <= 256) ?
    2 * threads * sizeof(uchar) :
    threads * sizeof(uchar);
reduce6<uchar, 256> <<< dimGrid, dimBlock, smemSize>>>
    (DATA_LENGTH, d_data1, d_reduced);

If you have changed the type of the shared memory buffer within the kernel, you must change its size in the kernel call as well.

The reason the results are zero in this case will be because the kernel is never running to completion. If you ran the code with cuda-memcheck, or added appropriate runtime API error checking, you would already know this.

talonmies
  • 70,661
  • 34
  • 192
  • 269
  • @tera: NP, I'll upvote yours as well. I am amazed I was able to get this posted at all, I'm sitting on a 737 at 10,000m somewhere over central Sweden at the moment – talonmies Mar 02 '17 at 08:52
  • That is amazing! So the rise of in-flight internet connectivity may increase the quality of SO answers then... – tera Mar 02 '17 at 08:58
  • Haha, tha's awesome! Many thanks for help. Really wish I could accept both answers this time. Also I will repeat here that I was used error checking, but didn't use cuda-memcheck. Many thanks for pointing that out too. – RAs Mar 02 '17 at 09:04
2

Add error checking to find that your kernel is not running at all in the cases where the returned sum is zero.

Run your program under cuda-memcheck to find that your kernel is producing out-of-bounds shared memory accesses when you change the type of the shared memory array or when you increase the blocksize beyond 256.

Then see that the size calculation in your full code on pastebin is incorrect for block sizes larger than 256, or when it's explicit reference to the type of shared memory array is not adjusted together with the actual type used in the kernel:

int smemSize = (threads <= 256) ?
    2 * threads * sizeof(uchar) :
    threads * sizeof(uchar);

You have no such case differentiation in the kernel code itself.

Community
  • 1
  • 1
tera
  • 7,080
  • 1
  • 21
  • 32
  • Ah, I see what you mean! I thought that in the comments you were talking about the memory allocation of the passed arrays. Absolutely, now I see the mistake. And yeap, as I suspected, once the shared array inside the kerneal is set to int, all my problems vanish away because it can properly accumulate the results that the `uchar` array could not. All problems solved, thanks! – RAs Mar 02 '17 at 09:01
  • By the way, I was using error checking and found out that the kernel was not being launched. I just described in the question what happened with the data - I'll make sure to be more precise next time about this. In any case, thanks for pointing me to cuda-memcheck. That I've not been using but seems like a life-saver – RAs Mar 02 '17 at 09:03