I had the idea about a warp based parallel reduction since all threads of a warp are in sync by definition.
So the idea was that the input data can be reduced by factor 64 (each thread reduces two elements) without any synchronization need.
Same as the original implementation by Mark Harris the reduction is applied on block-level and data is on shared memory. http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf
I created a kernel to test his version and my warp based version.
The kernel itself is completely identically storing BLOCK_SIZE elements in shared memory and outputting its result at its unique block index in an output array.
The algorithm itself works fine. Tested with full array of one's to test the "counting".
Function body of the implementations:
/**
* Performs a parallel reduction with operator add
* on the given array and writes the result with the thread 0
* to the given target value
*
* @param inValues T* Input float array, length must be a multiple of 2 and equal to blockDim.x
* @param targetValue float
*/
__device__ void reductionAddBlockThread_f(float* inValues,
float &outTargetVar)
{
// code of the below functions
}
1. Implementation of his version:
if (blockDim.x >= 1024 && threadIdx.x < 512)
inValues[threadIdx.x] += inValues[threadIdx.x + 512];
__syncthreads();
if (blockDim.x >= 512 && threadIdx.x < 256)
inValues[threadIdx.x] += inValues[threadIdx.x + 256];
__syncthreads();
if (blockDim.x >= 256 && threadIdx.x < 128)
inValues[threadIdx.x] += inValues[threadIdx.x + 128];
__syncthreads();
if (blockDim.x >= 128 && threadIdx.x < 64)
inValues[threadIdx.x] += inValues[threadIdx.x + 64];
__syncthreads();
//unroll last warp no sync needed
if (threadIdx.x < 32)
{
if (blockDim.x >= 64) inValues[threadIdx.x] += inValues[threadIdx.x + 32];
if (blockDim.x >= 32) inValues[threadIdx.x] += inValues[threadIdx.x + 16];
if (blockDim.x >= 16) inValues[threadIdx.x] += inValues[threadIdx.x + 8];
if (blockDim.x >= 8) inValues[threadIdx.x] += inValues[threadIdx.x + 4];
if (blockDim.x >= 4) inValues[threadIdx.x] += inValues[threadIdx.x + 2];
if (blockDim.x >= 2) inValues[threadIdx.x] += inValues[threadIdx.x + 1];
//set final value
if (threadIdx.x == 0)
outTargetVar = inValues[0];
}
Ressources:
4 syncthreads used
12 if statements used
11 read + add + write operations
1 final write operation
5 register usage
Performance:
five test runs average: ~ 19.54 ms
2. Warp based approach: (Same function body as above)
/*
* Perform first warp based reduction by factor of 64
*
* 32 Threads per Warp -> LOG2(32) = 5
*
* 1024 Threads / 32 Threads per Warp = 32 warps
* 2 elements compared per thread -> 32 * 2 = 64 elements per warp
*
* 1024 Threads/elements divided by 64 = 16
*
* Only half the warps/threads are active
*/
if (threadIdx.x < blockDim.x >> 1)
{
const unsigned int warpId = threadIdx.x >> 5;
// alternative threadIdx.x & 31
const unsigned int threadWarpId = threadIdx.x - (warpId << 5);
const unsigned int threadWarpOffset = (warpId << 6) + threadWarpId;
inValues[threadWarpOffset] += inValues[threadWarpOffset + 32];
inValues[threadWarpOffset] += inValues[threadWarpOffset + 16];
inValues[threadWarpOffset] += inValues[threadWarpOffset + 8];
inValues[threadWarpOffset] += inValues[threadWarpOffset + 4];
inValues[threadWarpOffset] += inValues[threadWarpOffset + 2];
inValues[threadWarpOffset] += inValues[threadWarpOffset + 1];
}
// synchronize all warps - the local warp result is stored
// at the index of the warp equals the first thread of the warp
__syncthreads();
// use first warp to reduce the 16 warp results to the final one
if (threadIdx.x < 8)
{
// get first element of a warp
const unsigned int warpIdx = threadIdx.x << 6;
if (blockDim.x >= 1024) inValues[warpIdx] += inValues[warpIdx + 512];
if (blockDim.x >= 512) inValues[warpIdx] += inValues[warpIdx + 256];
if (blockDim.x >= 256) inValues[warpIdx] += inValues[warpIdx + 128];
if (blockDim.x >= 128) inValues[warpIdx] += inValues[warpIdx + 64];
//set final value
if (threadIdx.x == 0)
outTargetVar = inValues[0];
}
Ressources:
1 syncthread used
7 if statements
10 read add write operations
1 final write operation
5 register usage
5 bit shifts
1 add
1 sub
Performance:
five test runs average: ~ 20.82 ms
Testing both kernels multiple times on a Geforce 8800 GT 512 mb with 256 mb of float values. And running kernel with 256 threads per block (100 % occupancy).
The warp based version is ~ 1.28 milliseconds slower.
If future card's allow larger block sizes the warp based approach would still need no further sync statement since the max is 4096 which get reduced to 64 which get reduced by final warp to 1
Why is it not faster?, or where is the flaw in the idea, kernel?
From ressources usage the warp approach should be ahead?
Edit1: Corrected the kernel that only half the threads are active not resulting in out of bound reads, added new performance data