1

I found this PDF (http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf) that walks you through several ways to optimize a reduce operation in CUDA and I'm trying to follow along. For reduction #5 it suggests unrolling the last 6 iterations of the loop with the following code:

if (tid < 32)
{
  sdata[tid] += sdata[tid + 32];
  sdata[tid] += sdata[tid + 16];
  sdata[tid] += sdata[tid + 8];
  sdata[tid] += sdata[tid + 4];
  sdata[tid] += sdata[tid + 2];
  sdata[tid] += sdata[tid + 1];
}

The previous slide even says:

  • As reduction proceeds, # “active” threads decreases
    • When s <= 32, we have only one warp left
  • Instructions are SIMD synchronous within a warp
  • That means when s <= 32:
    • We don’t need to __syncthreads()
    • We don’t need “if (tid < s)” because it doesn’t save any work

However when I tried this approach I got a MUCH smaller sum from the reduction than from the previous approach. If I add __syncthreads() after each write to shared memory then I get the correct result.

Are the comments about "Instructions are SIMD synchronous within a warp" and "We don't need to __syncthreads()" not true? Or is this an old document and the technology has changed?

Moohasha
  • 245
  • 1
  • 13

1 Answers1

3

You need to use the volatile keyword, as njuffa commented below.

There's a much more recent version of the same document here. https://developer.download.nvidia.com/assets/cuda/files/reduction.pdf

Here's the equivalent example #6 for reference.

template <unsigned int blockSize>
__device__ void warpReduce(volatile int *sdata, unsigned int tid) {
    if (blockSize >= 64) sdata[tid] += sdata[tid + 32];
    if (blockSize >= 32) sdata[tid] += sdata[tid + 16];
    if (blockSize >= 16) sdata[tid] += sdata[tid + 8];
    if (blockSize >= 8) sdata[tid] += sdata[tid + 4];
    if (blockSize >= 4) sdata[tid] += sdata[tid + 2];
    if (blockSize >= 2) sdata[tid] += sdata[tid + 1];
}

template <unsigned int blockSize>
__global__ void reduce6(int *g_idata, int *g_odata, unsigned int n) {
    extern __shared__ int sdata[];
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockSize*2) + tid;
    unsigned int gridSize = blockSize*2*gridDim.x;
    sdata[tid] = 0;

    while (i < n) { sdata[tid] += g_idata[i] + g_idata[i+blockSize]; i += gridSize; }

    __syncthreads();

    if (blockSize >= 512) { if (tid < 256) { sdata[tid] += sdata[tid + 256]; } __syncthreads(); }
    if (blockSize >= 256) { if (tid < 128) { sdata[tid] += sdata[tid + 128]; } __syncthreads(); }
    if (blockSize >= 128) { if (tid < 64) { sdata[tid] += sdata[tid + 64]; } __syncthreads(); }

    if (tid < 32) warpReduce(sdata, tid);
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}
Jonathan Olson
  • 1,166
  • 9
  • 19
  • 3
    Unless `sdata` has the `volatile` attribute, this code is unlikely to work as desired. There is an *implicit* data dependency between the statements in the `if`-block that the compiler cannot detect, and it will likely happily pull all the `sdata` reads to the start of the block to increase latency tolerance of the code. I would suggest using up-to-date documentation for CUDA, rather than outdated one. – njuffa Jul 26 '17 at 19:23
  • @Johnathan Olson I implemented reduce6 as well and had the same problem unless I called __syncthreads() after each write to sdata. – Moohasha Jul 26 '17 at 19:27
  • @njuffa Adding `volatile` to sdata appears to have fixed it! I don't know how old that documentation is, I just ran across it when looking for ways to optimize a reduce algorithm. – Moohasha Jul 26 '17 at 19:28
  • 1
    @Moohasha This document seems to be from CUDA 1.1 beta, so likely dates to around late 2007. – njuffa Jul 26 '17 at 19:34
  • thanks njuffa, edited the above – Jonathan Olson Jul 26 '17 at 21:00
  • Thanks for the link to the updated version. I didn't see a date or version number in the PDF that I found and had no idea it was so old! – Moohasha Jul 27 '17 at 18:30