CUDA - Parallel Reduction Sum

Question

I am trying to implement a parallel reduction sum in CUDA 7.5. I have been trying to follow the NVIDIA PDF that walks you through the initial algorithm and then steadily more optimised versions. I am currently making an array that is filled with 1 as the value in every array position so that I can check the output is correct but I am getting a value of -842159451 for an array of size 64. I am expecting that the kernel code is correct as I have followed the exact code from NVIDIA for it but here is my kernel:

__global__ void reduce0(int *input, int *output) {
    extern __shared__ int sdata[];

    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x * blockDim.x + threadIdx.x;

    sdata[tid] = input[i];

    __syncthreads();

    for (unsigned int s = 1; s < blockDim.x; s *= 2) {
        if (tid % (2 * s) == 0) {
            sdata[tid] += sdata[tid + s];
        }

        __syncthreads();
    }

    if (tid == 0) output[blockIdx.x] = sdata[0];
}

Here is my code calling the kernel, which is where I expect my problem to be:

int main()
{
    int numThreadsPerBlock = 1024;

    int *hostInput;
    int *hostOutput; 
    int *deviceInput; 
    int *deviceOutput; 

    int numInputElements = 64;
    int numOutputElements; // number of elements in the output list, initialised below

    numOutputElements = numInputElements / (numThreadsPerBlock / 2);
    if (numInputElements % (numThreadsPerBlock / 2)) {
        numOutputElements++;
    }

    hostInput = (int *)malloc(numInputElements * sizeof(int));
    hostOutput = (int *)malloc(numOutputElements * sizeof(int));


    for (int i = 0; i < numInputElements; ++i) {
        hostInput[i] = 1;
    }

    const dim3 blockSize(numThreadsPerBlock, 1, 1);
    const dim3 gridSize(numOutputElements, 1, 1);

    cudaMalloc((void **)&deviceInput, numInputElements * sizeof(int));
    cudaMalloc((void **)&deviceOutput, numOutputElements * sizeof(int));

    cudaMemcpy(deviceInput, hostInput, numInputElements * sizeof(int), cudaMemcpyHostToDevice);

    reduce0 << <gridSize, blockSize >> >(deviceInput, deviceOutput);

    cudaMemcpy(hostOutput, deviceOutput, numOutputElements * sizeof(int), cudaMemcpyDeviceToHost);

    for (int ii = 1; ii < numOutputElements; ii++) {
        hostOutput[0] += hostOutput[ii]; //accumulates the sum in the first element
    }

    int sumGPU = hostOutput[0];

    printf("GPU Result: %d\n", sumGPU);

    std::string wait;
    std::cin >> wait;

    return 0;
}

I have also tried bigger and smaller array sizes for the input and I get the same result of a very large negative value no matter the size of the array.

score 9 · Accepted Answer · answered Nov 02 '17 at 21:24

Seems you are using a dynamically allocated shared array:

extern __shared__ int sdata[];

but you are not allocating it in the kernel invocation:

reduce0 <<<gridSize, blockSize >>>(deviceInput, deviceOutput);

You have two options:

Option 1

Allocate the shared memory statically in the kernel, e.g.

constexpr int threadsPerBlock = 1024;
__shared__ int sdata[threadsPerBlock];

More often than not I find this the cleanest approach, as it works without a problem when you have multiple arrays in shared memory. The drawback is that while the size usually depends on the number of threads in the block, you need the size to be known at compile-time.

Option 2

Specify the amount of dynamically allocated shared memory in the kernel invocation.

reduce0 <<<gridSize, blockSize, numThreadsPerBlock*sizeof(int) >>>(deviceInput, deviceOutput);

This will work for any value of numThreadsPerBlock (provided it is within the allowed range of course). The drawback is that if you have multiple extern shared arrays, you need to figure out how to put then in the memory yourself, so that one does not overwrite the other.

Note, there may be other problems in your code. I didn't test it. This is something I spotted immediately upon glancing over your code.

CUDA - Parallel Reduction Sum

1 Answers1