I found this PDF (http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf) that walks you through several ways to optimize a reduce operation in CUDA and I'm trying to follow along. For reduction #5 it suggests unrolling the last 6 iterations of the loop with the following code:
if (tid < 32)
{
sdata[tid] += sdata[tid + 32];
sdata[tid] += sdata[tid + 16];
sdata[tid] += sdata[tid + 8];
sdata[tid] += sdata[tid + 4];
sdata[tid] += sdata[tid + 2];
sdata[tid] += sdata[tid + 1];
}
The previous slide even says:
- As reduction proceeds, # “active” threads decreases
- When s <= 32, we have only one warp left
- Instructions are SIMD synchronous within a warp
- That means when s <= 32:
- We don’t need to __syncthreads()
- We don’t need “if (tid < s)” because it doesn’t save any work
However when I tried this approach I got a MUCH smaller sum from the reduction than from the previous approach. If I add __syncthreads() after each write to shared memory then I get the correct result.
Are the comments about "Instructions are SIMD synchronous within a warp" and "We don't need to __syncthreads()" not true? Or is this an old document and the technology has changed?