CUDA - why is warp based parallel reduction slower?

Question

I had the idea about a warp based parallel reduction since all threads of a warp are in sync by definition.

So the idea was that the input data can be reduced by factor 64 (each thread reduces two elements) without any synchronization need.

Same as the original implementation by Mark Harris the reduction is applied on block-level and data is on shared memory. http://gpgpu.org/static/sc2007/SC07_CUDA_5_Optimization_Harris.pdf

I created a kernel to test his version and my warp based version.
The kernel itself is completely identically storing BLOCK_SIZE elements in shared memory and outputting its result at its unique block index in an output array.

The algorithm itself works fine. Tested with full array of one's to test the "counting".

Function body of the implementations:

/**
 * Performs a parallel reduction with operator add 
 * on the given array and writes the result with the thread 0
 * to the given target value
 *
 * @param inValues T* Input float array, length must be a multiple of 2 and equal to blockDim.x
 * @param targetValue float 
 */
__device__ void reductionAddBlockThread_f(float* inValues,
    float &outTargetVar)
{
    // code of the below functions
}

1. Implementation of his version:

if (blockDim.x >= 1024 && threadIdx.x < 512)
    inValues[threadIdx.x] += inValues[threadIdx.x + 512];
__syncthreads();
if (blockDim.x >= 512 && threadIdx.x < 256)
    inValues[threadIdx.x] += inValues[threadIdx.x + 256];
__syncthreads();
if (blockDim.x >= 256 && threadIdx.x < 128)
    inValues[threadIdx.x] += inValues[threadIdx.x + 128];
__syncthreads();
if (blockDim.x >= 128 && threadIdx.x < 64)
    inValues[threadIdx.x] += inValues[threadIdx.x + 64];
__syncthreads();

//unroll last warp no sync needed
if (threadIdx.x < 32)
{
    if (blockDim.x >= 64) inValues[threadIdx.x] += inValues[threadIdx.x + 32];
    if (blockDim.x >= 32) inValues[threadIdx.x] += inValues[threadIdx.x + 16];
    if (blockDim.x >= 16) inValues[threadIdx.x] += inValues[threadIdx.x + 8];
    if (blockDim.x >= 8) inValues[threadIdx.x] += inValues[threadIdx.x + 4];
    if (blockDim.x >= 4) inValues[threadIdx.x] += inValues[threadIdx.x + 2];
    if (blockDim.x >= 2) inValues[threadIdx.x] += inValues[threadIdx.x + 1];

    //set final value
    if (threadIdx.x == 0)
        outTargetVar = inValues[0];
}

Ressources:

4 syncthreads used
12 if statements used
11 read + add + write operations
1 final write operation
5 register usage

Performance:

five test runs average: ~ 19.54 ms

2. Warp based approach: (Same function body as above)

/*
 * Perform first warp based reduction by factor of 64
 *
 * 32 Threads per Warp -> LOG2(32) = 5
 *
 * 1024 Threads / 32 Threads per Warp = 32 warps
 * 2 elements compared per thread -> 32 * 2 = 64 elements per warp
 *
 * 1024 Threads/elements divided by 64 = 16
 * 
 * Only half the warps/threads are active
 */
if (threadIdx.x < blockDim.x >> 1)
{
    const unsigned int warpId = threadIdx.x >> 5;
    // alternative threadIdx.x & 31
    const unsigned int threadWarpId = threadIdx.x - (warpId << 5);
    const unsigned int threadWarpOffset = (warpId << 6) + threadWarpId;

    inValues[threadWarpOffset] += inValues[threadWarpOffset + 32];
    inValues[threadWarpOffset] += inValues[threadWarpOffset + 16];
    inValues[threadWarpOffset] += inValues[threadWarpOffset + 8];
    inValues[threadWarpOffset] += inValues[threadWarpOffset + 4];
    inValues[threadWarpOffset] += inValues[threadWarpOffset + 2];
    inValues[threadWarpOffset] += inValues[threadWarpOffset + 1];
}

// synchronize all warps - the local warp result is stored
// at the index of the warp equals the first thread of the warp
__syncthreads();

// use first warp to reduce the 16 warp results to the final one
if (threadIdx.x < 8)
{
    // get first element of a warp
    const unsigned int warpIdx = threadIdx.x << 6;

    if (blockDim.x >= 1024) inValues[warpIdx] += inValues[warpIdx + 512];
    if (blockDim.x >= 512) inValues[warpIdx] += inValues[warpIdx + 256];
    if (blockDim.x >= 256) inValues[warpIdx] += inValues[warpIdx + 128];
    if (blockDim.x >= 128) inValues[warpIdx] += inValues[warpIdx + 64];

    //set final value
    if (threadIdx.x == 0)
        outTargetVar = inValues[0];
}

Ressources:

1 syncthread used
7 if statements
10 read add write operations
1 final write operation
5 register usage

5 bit shifts
1 add
1 sub

Performance:

five test runs average: ~ 20.82 ms

Testing both kernels multiple times on a Geforce 8800 GT 512 mb with 256 mb of float values. And running kernel with 256 threads per block (100 % occupancy).

The warp based version is ~ 1.28 milliseconds slower.

If future card's allow larger block sizes the warp based approach would still need no further sync statement since the max is 4096 which get reduced to 64 which get reduced by final warp to 1

Why is it not faster?, or where is the flaw in the idea, kernel?

From ressources usage the warp approach should be ahead?

Edit1: Corrected the kernel that only half the threads are active not resulting in out of bound reads, added new performance data

harrism · Accepted Answer · 2012-10-07T23:36:52.983

12

I think the reason your code is slower than mine is that in my code, half as many warps are active for each ADD in the first phase. In your code, all warps are active for all of the first phase. So overall your code executes more warp instructions. In CUDA it's important to consider total "warp instructions" executed, not just the number of instructions executed by one warp.

Also, there's no point in only using half of your warps. There is overhead in launching the warps only to have them evaluate two branches and exit.

Another thought is that the use of unsigned char and short might actually be costing you performance. I'm not sure, but it's certainly not saving you registers since they are not packed into single 32-bit variables.

Also, in my original code, I replaced blockDim.x with a template parameter, BLOCKDIM, which means that it only used 5 run-time if statements (the ifs in the second stage are eliminated by the compiler).

BTW, a cheaper way to compute your threadWarpId is

const int threadWarpId = threadIdx.x & 31;

You might check this article for more ideas.

EDIT: Here's an alternative warp-based block reduction.

template <typename T, int level>
__device__
void sumReduceWarp(volatile T *sdata, const unsigned int tid)
{
  T t = sdata[tid];
  if (level > 5) sdata[tid] = t = t + sdata[tid + 32];
  if (level > 4) sdata[tid] = t = t + sdata[tid + 16];
  if (level > 3) sdata[tid] = t = t + sdata[tid +  8];
  if (level > 2) sdata[tid] = t = t + sdata[tid +  4];
  if (level > 1) sdata[tid] = t = t + sdata[tid +  2];
  if (level > 0) sdata[tid] = t = t + sdata[tid +  1];
}

template <typename T>
__device__
void sumReduceBlock(T *output, volatile T *sdata)
{
  // sdata is a shared array of length 2 * blockDim.x

  const unsigned int warp = threadIdx.x >> 5;
  const unsigned int lane = threadIdx.x & 31;
  const unsigned int tid  = (warp << 6) + lane;

  sumReduceWarp<T, 5>(sdata, tid);
  __syncthreads();

  // lane 0 of each warp now contains the sum of two warp's values
  if (lane == 0) sdata[warp] = sdata[tid];

  __syncthreads();

  if (warp == 0) {
    sumReduceWarp<T, 4>(sdata, threadIdx.x);
    if (lane == 0) *output = sdata[0];
  }
}

This should be a bit faster because it uses all the warps that are launched in the first stage, and has no branching within the last stage, at the cost of an extra branch, shared load/store and __syncthreads() in the new middle stage. I haven't tested this code. If you run it, let me know how it performs. If you use a template for the blockDim in your original code it may again be faster, but I think this code is more succinct.

Note the temporary variable t is used because Fermi and later architectures use a pure load/store architecture, so += from shared memory to shared memory results in an extra load (since the sdata pointer must be volatile). Explicitly loading into the temporary once avoids this. On G80 it won't make a difference to performance.

edited Oct 07 '12 at 23:36

answered Oct 04 '12 at 23:37

harrism

26,505
2
57
88

I must apologize to have said that the shared memory array length is equal the BLOCK_SIZE, but it must either be equal the double block (which makes no sense in initializing from global memory) size or only half the threads should be active. And since all threads are active there are out of bound read and writes. I will correct this and will check it out again. The corrected kernel now is definetly correct. It produces always correct mathematical results as i stated. The strange thing is why these shared memory out of bound accesses did not create wrong results. – djmj Oct 05 '12 at 00:40
About the BLOCKDIM template parameter: Thats why i posted my implementation of your algorithm. And i still need the more expensive way to compute `warpId` since I need it calculating `threadWarpOffset`. – djmj Oct 05 '12 at 00:40
Sorry, I missed the 6 rather than 5 in the third line. Edited. I think my reasoning about why your code is slower is likely to be at least part of the reason. – harrism Oct 05 '12 at 00:48
From first wrong kernel where all threads participated the `total warp instructions` were definetly higher. But why should this be relevant, if only half of the threads are active and the rest is waiting for the next sync statement? So instead of waiting they could also do something? Especially if the threads within a warp are in sync? – djmj Oct 05 '12 at 00:54
Idle warps do not use processing cycles. Active warps do. Your code overall has more active warps for more instructions, so total cycles is higher. My code does the same computation in less total cycles. I think that a warp-wise reduction is doable (have done it myself), but I would do it differently. I'll try to dig up some code... – harrism Oct 05 '12 at 00:56
Hmm i think i get it slowly. So i have to calculate the real amount of instructions necessary. Oh man, it seemed so straightforward. Damn there is never a real way how to do it with cuda. About the unsigned short and unsigned char. I replaced both with int and it resultd in one register less??? Oh man... you think you programm correctly in all your kernels and save ressources and then. Is a vec2 of short packed in 32-bit register? – djmj Oct 05 '12 at 01:28
Would still like to see your warp-wise reduction. Just hoping for blocksize of 4096 in future cuda releases to outperform your version ;) – djmj Oct 05 '12 at 02:07
Bigger blocks aren't necessarily going to help. You want enough parallelism to fill the machine and hide memory latency, and no more. That's why the fast algorithm does iterative summation within each thread until the total number of partial sums is smaller than a thread block can sum with a tree. You don't want a tree bigger than you have to have because it is inefficient. – harrism Oct 05 '12 at 03:03
Added an example warp reduction. This is very similar to the scan used in CUDPP. – harrism Oct 05 '12 at 03:34
Thanks, Also had the idea about `if (lane == 0) sdata[warp] = sdata[tid];` to store the warp results at the warp index. (Is my strided indexing in shared memory relevant?, there is nothing like coalesced shared memory access?) But I thought using `warpIdx = threadIdx.x << 6; ` would save this if condition and sync. In your last reduction all 32 thread participate which should not be necessary since 8 threads are max necessary. But this and your removed check for the BLOCKDIM makes it more scalable for future devices. But should cost performance compared to my one. Will check your performance. – djmj Oct 05 '12 at 15:07
You can't have less than 32 active threads, or you are leaving hardware idle, so there's no reason to use only 8 threads. Good point about strides indexing. Your version could have 8-way shared mem bank conflicts which is pretty bad. No bank conflicts in mine. – harrism Oct 06 '12 at 01:15
Thomas, aren't you 'cheating' by letting yourself assume that `blockDim.x` is 1024? Rather than using it as a variable, I mean? Anyway, I edited the code to reflect that assumption more clearly. – einpoklum Dec 02 '13 at 06:18
Also, you must use those 'if' statements during those +32, +16, +8, +4, +2, +1 operations. Otherwise, you may get a wrong result. E.g. on the +16 step it can occasionally add 32nd element into 16th element prior to 16th element being added into 0th. – Olex Mar 03 '17 at 21:52

score 0 · Answer 2 · answered Oct 05 '12 at 00:06

You should also check the Examples in the SDK. I remember one very nice example with implementations of several ways of reductions. At least one of those also uses warp based reduction.

(I can't look up the name right now, because I have it only installed on my other machine)

CUDA - why is warp based parallel reduction slower?

2 Answers2

Linked