Parallel reduction #5 unroll the last warp

Question

I'm studying the reduction from the well-known slides by Mark Harris. In particular, I implemented the optimization step #5 but I obtain a wrong result: 17 instead of 41. I used the same number sequence shown in the slides. I omitted in the code the "extern" for the shared array, because the kernel and the host code are in the same .cu file.

#include <stdio.h>
#include <cuda_runtime.h>

#define THREAD_PER_BLOCK 16

__global__ void reduce5(int *g_idata, int *g_odata) {
    __shared__ int sdata[THREAD_PER_BLOCK];
    // perform first level of reduction,
    // reading from global memory, writing to shared memory
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*(blockDim.x*2) + threadIdx.x;
    sdata[tid] = g_idata[i] + g_idata[i+blockDim.x];
    __syncthreads();
    // do reduction in shared mem
    for (unsigned int s=blockDim.x/2; s>32; s>>=1) {
        if (tid < s) sdata[tid] += sdata[tid + s];
        __syncthreads();
    }

    if (tid < 32)
    {
        sdata[tid] += sdata[tid + 32];
        sdata[tid] += sdata[tid + 16];
        sdata[tid] += sdata[tid + 8];
        sdata[tid] += sdata[tid + 4];
        sdata[tid] += sdata[tid + 2];
        sdata[tid] += sdata[tid + 1];
    }
    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

int main()
{
    int inputLength=16;
    int hostInput[16]={10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2};
    int hostOutput=0;
    int *deviceInput;
    int *deviceOutput;

    cudaMalloc((void **)&deviceInput, inputLength * sizeof(int));
    cudaMalloc((void **)&deviceOutput, sizeof(int));

    cudaMemcpy(deviceInput, hostInput, inputLength * sizeof(int),cudaMemcpyHostToDevice);

    reduce5<<<1,16>>>(deviceInput, deviceOutput);

    cudaDeviceSynchronize();

    cudaMemcpy(&hostOutput, deviceOutput,sizeof(int), cudaMemcpyDeviceToHost);

    printf("%d\n",hostOutput);

    cudaFree(deviceInput);
    cudaFree(deviceOutput);

    return 0;
}

talonmies · Answer 1 · 2017-05-23T09:46:01.710

In this code THREAD_PER_BLOCK must be a multiple of 32, and at least 64. The length of the input data must also be twice the product of block and grid sizes.

You don't see it (because you aren't performing any sort of error checking), but the thread and warp reductions will be failing because of out-of-bounds shared memory and global memeory access.

Also note that extern __shared__ has nothing to do with whether the kernel and other code are in the same file. That denotes that shared memory for that variable will be dynamically allocated at runtime, rather than statically at compile time. The size of the allocation is passed as the third argument in the kernel launch syntax.

I performed the following changes THREAD_PER_BLOCK 64 and reduce5<<<1,THREAD_PER_BLOCK>>>(deviceInput, deviceOutput); but I obtain always 17. — horus, May 22 '17 at 07:56

score 1 · Answer 2 · answered Jul 21 '17 at 03:32

1

I encountered the same problem and found it turns out to be the threads are not actually in synchronization if the variable is not declared as volatile.

Simply add volatile when declaring sdata the problem will be solved. Please refer to my post: cuda Threads in A Warp appear to be not in synchronization

answered Jul 21 '17 at 03:32

Leonard Hung

21
3

Thank you very much :) It fixed a similar problem in my code :) – Baptiste Wicht Aug 30 '17 at 11:29

Parallel reduction #5 unroll the last warp

2 Answers2