Removing __syncthreads() in CUDA warp-level reduction

Question

The following code sums every 32 elements in an array to the very first element of each 32 element group:

int i = threadIdx.x;
int warpid = i&31;
if(warpid < 16){
    s_buf[i] += s_buf[i+16];__syncthreads();
    s_buf[i] += s_buf[i+8];__syncthreads();
    s_buf[i] += s_buf[i+4];__syncthreads();
    s_buf[i] += s_buf[i+2];__syncthreads();
    s_buf[i] += s_buf[i+1];__syncthreads();
}

I thought I can eliminate all the __syncthreads() in the code, since all the operations are done in the same warp. But if I eliminate them, I get garbage results back. It shall not affect performance too much, but I want to know why I need __syncthreads() here.

yes, it's a Quadro 6000, and I am using CUDA4.0. In fact, I have used similar technique on a GTX 580. I was surprised this doesn't work without __syncthreads() — small_potato, May 23 '12 at 23:46
You do realise that `threadIdx.x & 31` isn't the warp number and `(threadIdx.x & 31) < 16` doesn't select threads within the same warp? — talonmies, May 24 '12 at 00:10
I might get something wrong here. Isn't (threadIdx.x & 31) select the first 16 threads of each warp? — small_potato, May 24 '12 at 01:31
@talonmies I think "warpid" is the thread index inside that warp [0-31] — djmj, May 24 '12 at 02:47
@small_potato: Yes it is the first 16 threads of each warp, but unless your intention is to produce a reduction sum *per warp* (so multiple sums per block), then I don't see how this helps you. But the main problem is probably how `s_buf` has been declared. Have you declared it `volatile`? — talonmies, May 24 '12 at 05:27
@small_potato Could you provide full code of this reduction usage? — geek, May 24 '12 at 08:56
@talonmies: If I declare the shared memory as volatile, it actually works. I always thought the threads in the same warp see the same result in shared memory. Apparently, I was wrong. — small_potato, May 25 '12 at 18:10
@small_potato: this is documented in the Fermi programming guide. Fermi cards don't have specific instructions to operate on shared memory. If compiler optimisation results in a value being held in register instead of written back to shared memory, the implicit synchronisation between threads in a warp can be broken. Declaring the shared memory `volatile` eliminates the problem. — talonmies, May 25 '12 at 18:43
BTW, `threadIdx.x & 31` is the lane ID, not the warp ID. Your non-standard naming is what confused talonmies, I suspect. The calculation for warp ID is `threadIdx.x / warpSize` (or `threadIdx.x >> LOG2_WARP_SIZE` to be more efficient). — harrism, May 28 '12 at 00:06

score 7 · Answer 1 · edited Mar 20 '15 at 09:38

I'm providing an answer here because I think that the above two are not fully satisfactory. The "intellectual property" of this answer belongs to Mark Harris, who has pointed out this issue in this presentation (slide 22), and to @talonmies, who has pointed this problem out to the OP in the comments above.

Let me first try to resume what the OP was asking, filtering his mistakes.

The OP seems to be dealing with the last step of reduction in shared memory reduction, warp reduction by loop unrolling. He is doing something like

template <class T>
__device__ void warpReduce(T *sdata, int tid) {
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid + 8];
    sdata[tid] += sdata[tid + 4];
    sdata[tid] += sdata[tid + 2];
    sdata[tid] += sdata[tid + 1];
}

template <class T>
__global__ void reduce4_no_synchthreads(T *g_idata, T *g_odata, unsigned int N)
{
    extern __shared__ T sdata[];

    unsigned int tid    = threadIdx.x;                              // Local thread index
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;       // Global thread index - Fictitiously double the block dimension

    // --- Performs the first level of reduction in registers when reading from global memory. 
    T mySum = (i < N) ? g_idata[i] : 0;
    if (i + blockDim.x < N) mySum += g_idata[i+blockDim.x];
    sdata[tid] = mySum;

    // --- Before going further, we have to make sure that all the shared memory loads have been completed
    __syncthreads();

    // --- Reduction in shared memory. Only half of the threads contribute to reduction.
    for (unsigned int s=blockDim.x/2; s>32; s>>=1)
    {
        if (tid < s) { sdata[tid] = mySum = mySum + sdata[tid + s]; }
        // --- At the end of each iteration loop, we have to make sure that all memory operations have been completed
        __syncthreads();
    }

    // --- Single warp reduction by loop unrolling. Assuming blockDim.x >64
    if (tid < 32) warpReduce(sdata, tid);

    // --- Write result for this block to global memory. At the end of the kernel, global memory will contain the results for the summations of
    //     individual blocks
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

As pointed out by Mark Harris and talonmies, the shared memory variable sdata must be declared as volatile, to prevent compiler optimizations. So, the right way to define the __device__ function above is:

template <class T>
__device__ void warpReduce(volatile T *sdata, int tid) {
    sdata[tid] += sdata[tid + 32];
    sdata[tid] += sdata[tid + 16];
    sdata[tid] += sdata[tid + 8];
    sdata[tid] += sdata[tid + 4];
    sdata[tid] += sdata[tid + 2];
    sdata[tid] += sdata[tid + 1];
}

Let us now see the disassembled codes corresponding to the two cases above examined, i.e., sdata declared as not volatile or volatile (code compiled for Fermi architecture).

Not volatile

    /*0000*/         MOV R1, c[0x1][0x100];                          /* 0x2800440400005de4 */
    /*0008*/         S2R R0, SR_CTAID.X;                             /* 0x2c00000094001c04 */
    /*0010*/         SHL R3, R0, 0x1;                                /* 0x6000c0000400dc03 */
    /*0018*/         S2R R2, SR_TID.X;                               /* 0x2c00000084009c04 */
    /*0020*/         IMAD R3, R3, c[0x0][0x8], R2;                   /* 0x200440002030dca3 */
    /*0028*/         IADD R4, R3, c[0x0][0x8];                       /* 0x4800400020311c03 */
    /*0030*/         ISETP.LT.U32.AND P0, PT, R3, c[0x0][0x28], PT;  /* 0x188e4000a031dc03 */
    /*0038*/         ISETP.GE.U32.AND P1, PT, R4, c[0x0][0x28], PT;  /* 0x1b0e4000a043dc03 */
    /*0040*/     @P0 ISCADD R3, R3, c[0x0][0x20], 0x2;               /* 0x400040008030c043 */
    /*0048*/    @!P1 ISCADD R4, R4, c[0x0][0x20], 0x2;               /* 0x4000400080412443 */
    /*0050*/    @!P0 MOV R5, RZ;                                     /* 0x28000000fc0161e4 */
    /*0058*/    @!P1 LD R4, [R4];                                    /* 0x8000000000412485 */
    /*0060*/     @P0 LD R5, [R3];                                    /* 0x8000000000314085 */
    /*0068*/         SHL R3, R2, 0x2;                                /* 0x6000c0000820dc03 */
    /*0070*/         NOP;                                            /* 0x4000000000001de4 */
    /*0078*/    @!P1 IADD R5, R4, R5;                                /* 0x4800000014416403 */
    /*0080*/         MOV R4, c[0x0][0x8];                            /* 0x2800400020011de4 */
    /*0088*/         STS [R3], R5;                                   /* 0xc900000000315c85 */
    /*0090*/         BAR.RED.POPC RZ, RZ, RZ, PT;                    /* 0x50ee0000ffffdc04 */
    /*0098*/         MOV R6, c[0x0][0x8];                            /* 0x2800400020019de4 */
    /*00a0*/         ISETP.LT.U32.AND P0, PT, R6, 0x42, PT;          /* 0x188ec0010861dc03 */
    /*00a8*/     @P0 BRA 0x118;                                      /* 0x40000001a00001e7 */
    /*00b0*/         NOP;                                            /* 0x4000000000001de4 */
    /*00b8*/         NOP;                                            /* 0x4000000000001de4 */
    /*00c0*/         MOV R6, R4;                                     /* 0x2800000010019de4 */
    /*00c8*/         SHR.U32 R4, R4, 0x1;                            /* 0x5800c00004411c03 */
    /*00d0*/         ISETP.GE.U32.AND P0, PT, R2, R4, PT;            /* 0x1b0e00001021dc03 */
    /*00d8*/    @!P0 IADD R7, R4, R2;                                /* 0x480000000841e003 */
    /*00e0*/    @!P0 SHL R7, R7, 0x2;                                /* 0x6000c0000871e003 */
    /*00e8*/    @!P0 LDS R7, [R7];                                   /* 0xc10000000071e085 */
    /*00f0*/    @!P0 IADD R5, R7, R5;                                /* 0x4800000014716003 */
    /*00f8*/    @!P0 STS [R3], R5;                                   /* 0xc900000000316085 */
    /*0100*/         BAR.RED.POPC RZ, RZ, RZ, PT;                    /* 0x50ee0000ffffdc04 */
    /*0108*/         ISETP.GT.U32.AND P0, PT, R6, 0x83, PT;          /* 0x1a0ec0020c61dc03 */
    /*0110*/     @P0 BRA 0xc0;                                       /* 0x4003fffea00001e7 */
    /*0118*/         ISETP.GT.U32.AND P0, PT, R2, 0x1f, PT;          /* 0x1a0ec0007c21dc03 */
    /*0120*/     @P0 BRA.U 0x198;                                    /* 0x40000001c00081e7 */
    /*0128*/    @!P0 LDS R8, [R3];                                   /* 0xc100000000322085 */
    /*0130*/    @!P0 LDS R5, [R3+0x80];                              /* 0xc100000200316085 */
    /*0138*/    @!P0 LDS R4, [R3+0x40];                              /* 0xc100000100312085 */
    /*0140*/    @!P0 LDS R7, [R3+0x20];                              /* 0xc10000008031e085 */
    /*0148*/    @!P0 LDS R6, [R3+0x10];                              /* 0xc10000004031a085 */
    /*0150*/    @!P0 IADD R8, R8, R5;                                /* 0x4800000014822003 */
    /*0158*/    @!P0 IADD R8, R8, R4;                                /* 0x4800000010822003 */
    /*0160*/    @!P0 LDS R5, [R3+0x8];                               /* 0xc100000020316085 */
    /*0168*/    @!P0 IADD R7, R8, R7;                                /* 0x480000001c81e003 */
    /*0170*/    @!P0 LDS R4, [R3+0x4];                               /* 0xc100000010312085 */
    /*0178*/    @!P0 IADD R6, R7, R6;                                /* 0x480000001871a003 */
    /*0180*/    @!P0 IADD R5, R6, R5;                                /* 0x4800000014616003 */
    /*0188*/    @!P0 IADD R4, R5, R4;                                /* 0x4800000010512003 */
    /*0190*/    @!P0 STS [R3], R4;                                   /* 0xc900000000312085 */
    /*0198*/         ISETP.NE.AND P0, PT, R2, RZ, PT;                /* 0x1a8e0000fc21dc23 */
    /*01a0*/     @P0 BRA.U 0x1c0;                                    /* 0x40000000600081e7 */
    /*01a8*/    @!P0 ISCADD R0, R0, c[0x0][0x24], 0x2;               /* 0x4000400090002043 */
    /*01b0*/    @!P0 LDS R2, [RZ];                                   /* 0xc100000003f0a085 */
    /*01b8*/    @!P0 ST [R0], R2;                                    /* 0x900000000000a085 */
    /*01c0*/         EXIT;                                           /* 0x8000000000001de7 */

Lines /*0128*/-/*0148*/, /*0160*/ and /*0170*/ correspond to the shared memory loads to registers and line /*0190*/ to the shared memory store from register. The intermediate lines correspond to the summations, as performed in registers. So, the intermediate results are kept in registers (which are private to each thread) and not flushed each time to shared memory, preventing the threads to have full visibility of the intermediate results.

volatile

    /*0000*/         MOV R1, c[0x1][0x100];                          /* 0x2800440400005de4 */
    /*0008*/         S2R R0, SR_CTAID.X;                             /* 0x2c00000094001c04 */
    /*0010*/         SHL R3, R0, 0x1;                                /* 0x6000c0000400dc03 */
    /*0018*/         S2R R2, SR_TID.X;                               /* 0x2c00000084009c04 */
    /*0020*/         IMAD R3, R3, c[0x0][0x8], R2;                   /* 0x200440002030dca3 */
    /*0028*/         IADD R4, R3, c[0x0][0x8];                       /* 0x4800400020311c03 */
    /*0030*/         ISETP.LT.U32.AND P0, PT, R3, c[0x0][0x28], PT;  /* 0x188e4000a031dc03 */
    /*0038*/         ISETP.GE.U32.AND P1, PT, R4, c[0x0][0x28], PT;  /* 0x1b0e4000a043dc03 */
    /*0040*/     @P0 ISCADD R3, R3, c[0x0][0x20], 0x2;               /* 0x400040008030c043 */
    /*0048*/    @!P1 ISCADD R4, R4, c[0x0][0x20], 0x2;               /* 0x4000400080412443 */
    /*0050*/    @!P0 MOV R5, RZ;                                     /* 0x28000000fc0161e4 */
    /*0058*/    @!P1 LD R4, [R4];                                    /* 0x8000000000412485 */
    /*0060*/     @P0 LD R5, [R3];                                    /* 0x8000000000314085 */
    /*0068*/         SHL R3, R2, 0x2;                                /* 0x6000c0000820dc03 */
    /*0070*/         NOP;                                            /* 0x4000000000001de4 */
    /*0078*/    @!P1 IADD R5, R4, R5;                                /* 0x4800000014416403 */
    /*0080*/         MOV R4, c[0x0][0x8];                            /* 0x2800400020011de4 */
    /*0088*/         STS [R3], R5;                                   /* 0xc900000000315c85 */
    /*0090*/         BAR.RED.POPC RZ, RZ, RZ, PT;                    /* 0x50ee0000ffffdc04 */
    /*0098*/         MOV R6, c[0x0][0x8];                            /* 0x2800400020019de4 */
    /*00a0*/         ISETP.LT.U32.AND P0, PT, R6, 0x42, PT;          /* 0x188ec0010861dc03 */
    /*00a8*/     @P0 BRA 0x118;                                      /* 0x40000001a00001e7 */
    /*00b0*/         NOP;                                            /* 0x4000000000001de4 */
    /*00b8*/         NOP;                                            /* 0x4000000000001de4 */
    /*00c0*/         MOV R6, R4;                                     /* 0x2800000010019de4 */
    /*00c8*/         SHR.U32 R4, R4, 0x1;                            /* 0x5800c00004411c03 */
    /*00d0*/         ISETP.GE.U32.AND P0, PT, R2, R4, PT;            /* 0x1b0e00001021dc03 */
    /*00d8*/    @!P0 IADD R7, R4, R2;                                /* 0x480000000841e003 */
    /*00e0*/    @!P0 SHL R7, R7, 0x2;                                /* 0x6000c0000871e003 */
    /*00e8*/    @!P0 LDS R7, [R7];                                   /* 0xc10000000071e085 */
    /*00f0*/    @!P0 IADD R5, R7, R5;                                /* 0x4800000014716003 */
    /*00f8*/    @!P0 STS [R3], R5;                                   /* 0xc900000000316085 */
    /*0100*/         BAR.RED.POPC RZ, RZ, RZ, PT;                    /* 0x50ee0000ffffdc04 */
    /*0108*/         ISETP.GT.U32.AND P0, PT, R6, 0x83, PT;          /* 0x1a0ec0020c61dc03 */
    /*0110*/     @P0 BRA 0xc0;                                       /* 0x4003fffea00001e7 */
    /*0118*/         ISETP.GT.U32.AND P0, PT, R2, 0x1f, PT;          /* 0x1a0ec0007c21dc03 */
    /*0120*/         SSY 0x1f0;                                      /* 0x6000000320000007 */
    /*0128*/     @P0 NOP.S;                                          /* 0x40000000000001f4 */
    /*0130*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*0138*/         LDS R4, [R3+0x80];                              /* 0xc100000200311c85 */
    /*0140*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*0148*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*0150*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*0158*/         LDS R4, [R3+0x40];                              /* 0xc100000100311c85 */
    /*0160*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*0168*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*0170*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*0178*/         LDS R4, [R3+0x20];                              /* 0xc100000080311c85 */
    /*0180*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*0188*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*0190*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*0198*/         LDS R4, [R3+0x10];                              /* 0xc100000040311c85 */
    /*01a0*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*01a8*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*01b0*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*01b8*/         LDS R4, [R3+0x8];                               /* 0xc100000020311c85 */
    /*01c0*/         IADD R6, R5, R4;                                /* 0x4800000010519c03 */
    /*01c8*/         STS [R3], R6;                                   /* 0xc900000000319c85 */
    /*01d0*/         LDS R5, [R3];                                   /* 0xc100000000315c85 */
    /*01d8*/         LDS R4, [R3+0x4];                               /* 0xc100000010311c85 */
    /*01e0*/         IADD R4, R5, R4;                                /* 0x4800000010511c03 */
    /*01e8*/         STS.S [R3], R4;                                 /* 0xc900000000311c95 */
    /*01f0*/         ISETP.NE.AND P0, PT, R2, RZ, PT;                /* 0x1a8e0000fc21dc23 */
    /*01f8*/     @P0 BRA.U 0x218;                                    /* 0x40000000600081e7 */
    /*0200*/    @!P0 ISCADD R0, R0, c[0x0][0x24], 0x2;               /* 0x4000400090002043 */
    /*0208*/    @!P0 LDS R2, [RZ];                                   /* 0xc100000003f0a085 */
    /*0210*/    @!P0 ST [R0], R2;                                    /* 0x900000000000a085 */
    /*0218*/         EXIT;                                           /* 0x8000000000001de7 */

As it can be seen from lines /*0130*/-/*01e8*/, now each time a summation is performed, the intermediate result is immediately flushed to shared memory for full thread visibility.

djmj · Answer 2 · 2012-05-24T02:43:40.690

0

Maybe have a look at these Slides from Mark Harris. Why reinvent the wheel.

www.uni-graz.at/~haasegu/Lectures/GPU_CUDA/Lit/reduction.pdf?page=35

Each reduction step is dependent on the other. So you can only leave out the synchronization in the last excecuted warp equals 32 active threads in the reduction phase. One step before you need 64 threads and hence need a synchronisation since parallel execution is not guaranteed since you use 2 warps.

edited May 24 '12 at 02:43

answered May 24 '12 at 02:23

djmj

5,579
5
54
92

That's pretty much I want to do. The problem is really, when I leave __syncthreads() out, things start to break. And The code actually work in debug mode while it breaks in release mode. – small_potato May 24 '12 at 02:42
Is your intention to implement warp-based reduction? Reduce inside warp to reduce data by factor 32? so with 1024 threads/elements only 2 syncthreads are necessary? This could maybe improve performance much compared to conventional implementation. Will check this idea out later. – djmj May 24 '12 at 03:15
The problem I am facing is just to sum 128 numbers residing in shared memory. I am not facing a global reduction problem, but what you saying might work as well. – small_potato May 24 '12 at 03:44
Then use the code at page 35 of the pdf using only one syncthread. – djmj May 24 '12 at 21:01

Removing __syncthreads() in CUDA warp-level reduction

2 Answers2

Linked