Relying on NVIDIA's samples and on some good advice found here at SO, I have been managing to implement a few array-reduction kernels that I need for my project. However, one particular issue remains causing me trouble. It is, how to properly do sum-reduction for arrays of unsigned chars (uchar
).
Because uchar
can hold values from 0 to 255, of course the thread blocks can't accumulate a value greater than 255 per thread block. My intuition was that it would be merely a case of collecting the sums inside the sum-reduction function in an int
despite the input being uchar
. However, it does not work.
Let me show in detail what I have. Below is my kernel to sum-reduce an array of uchar
- it is a slighly modified version of the famous reduce6
function in NVIDIA's samples:
template <class T, unsigned int blockSize>
__global__ void reduce6(int n, T *g_idata, int *g_odata)
{
extern __shared__ T sdata[];
// perform first level of reduction,
// reading from global memory, writing to shared memory
unsigned int tid = threadIdx.x;
unsigned int i = blockIdx.x*blockSize * 2 + threadIdx.x;
unsigned int gridSize = blockSize * 2 * gridDim.x;
int mySum = 0;
// we reduce multiple elements per thread. The number is determined by the
// number of active thread blocks (via gridDim). More blocks will result
// in a larger gridSize and therefore fewer elements per thread
while (i < n)
{
mySum += g_idata[i];
// ensure we don't read out of bounds
if (i + blockSize < n) mySum += g_idata[i + blockSize];
i += gridSize;
}
// each thread puts its local sum into shared memory
sdata[tid] = mySum;
__syncthreads();
// do reduction in shared mem
if ((blockSize >= 512) && (tid < 256))
sdata[tid] = mySum = mySum + sdata[tid + 256];
__syncthreads();
if ((blockSize >= 256) && (tid < 128))
sdata[tid] = mySum = mySum + sdata[tid + 128];
__syncthreads();
if ((blockSize >= 128) && (tid < 64))
sdata[tid] = mySum = mySum + sdata[tid + 64];
__syncthreads();
// fully unroll reduction within a single warp
if ((blockSize >= 64) && (tid < 32))
sdata[tid] = mySum = mySum + sdata[tid + 32];
__syncthreads();
if ((blockSize >= 32) && (tid < 16))
sdata[tid] = mySum = mySum + sdata[tid + 16];
__syncthreads();
if ((blockSize >= 16) && (tid < 8))
sdata[tid] = mySum = mySum + sdata[tid + 8];
__syncthreads();
if ((blockSize >= 8) && (tid < 4))
sdata[tid] = mySum = mySum + sdata[tid + 4];
__syncthreads();
if ((blockSize >= 4) && (tid < 2))
sdata[tid] = mySum = mySum + sdata[tid + 2];
__syncthreads();
if ((blockSize >= 2) && (tid < 1))
mySum += sdata[tid + 1];
__syncthreads();
// write result for this block to global mem
if (tid == 0) atomicAdd(g_odata, mySum);
}
When such kernel is called by using reduce6<uchar, Blocksize>
such that Blocksize*num.threads = 256
, everything works properly and the sum-reduction gets the right result. Whenever such ratio is not 256, the result of the sum-reduction becomes wrong - which is merely due to what I said in the bebinning, i.e. uchar
can't acumulate values greater than 255.
To me, the intuitive solution would be to simply change the line:
extern __shared__ T sdata[];
To:
extern __shared__ int sdata[];
Since sdata
is a shared array created within the sum-reduction kernel, I thought that it could be of any type and thus properly accumulate whatever values result from the thread-block summation. Maybe, to make it sure, I even wrote the while
loop with an explicit conversion of the income data into int
:
while (i < n)
{
mySum += (int)g_idata[i];
// ensure we don't read out of bounds
if (i + blockSize < n) mySum += (int)g_idata[i + blockSize];
i += gridSize;
}
However, to my surprise, all tha only makes the sum-reduction result to be always zero.
What am I missing? How could I alter such kernel to make it so that the uchar
array being passed can be properly sum-reduced with arbitrary number of thread-blocks and threads?
If needed, a full example code can be found at: http://pastebin.com/nq1VRJCs