Result of reduction#1 is wrong

Question

I implemented the reduction#1 form the well-known slides by Mark Harris, but I obtain 0 as result. I filled the input array with the same values shown in the slides. I compiled with cuda 7.0 using the command nvcc reduction1.cu -o red1. Where is the mistake? Thanks.

#include <stdio.h>
#include <cuda_runtime.h>

#define THREADS_PER_BLOCK 16

__global__ void reduce1(int *g_idata, int *g_odata) {
    extern __shared__ int sdata[];
    // each thread loads one element from global to shared mem
    unsigned int tid = threadIdx.x;
    unsigned int i = blockIdx.x*blockDim.x + threadIdx.x;
    sdata[tid] = g_idata[i];
    __syncthreads();

    // do reduction in shared mem
    for(unsigned int s=1; s < blockDim.x; s *= 2) 
    {
        if (tid % (2*s) == 0) sdata[tid] += sdata[tid + s];
            __syncthreads();
    }

    // write result for this block to global mem
    if (tid == 0) g_odata[blockIdx.x] = sdata[0];
}

int main()
{
    int inputLength=16;
    int hostInput[16]={10,1,8,-1,0,-2,3,5,-2,-3,2,7,0,11,0,2};
    int hostOutput=0;
    int *deviceInput;
    int *deviceOutput;

    cudaMalloc((void **)&deviceInput, inputLength * sizeof(int));
    cudaMalloc((void **)&deviceOutput, sizeof(int));

    cudaMemcpy(deviceInput, hostInput, inputLength * sizeof(int),cudaMemcpyHostToDevice);

    reduce1<<<1,THREADS_PER_BLOCK>>>(deviceInput, deviceOutput);

    cudaDeviceSynchronize();

    cudaMemcpy(&hostOutput, deviceOutput,sizeof(int), cudaMemcpyDeviceToHost);

    printf("%d\n",hostOutput);

    cudaFree(deviceInput);
    cudaFree(deviceOutput);

    return 0;
}

You are not specifying any size for the dynamic shared memory allocation. I specifically mentioned how this works in my last answer to you. If you bother with error checking you will find the kernel is failing with a memory access violation. — talonmies, May 22 '17 at 09:16
You should implement [proper CUDA error checking](https://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) **before** asking others for help. It may help you to understand the problem yourself, and even if you don't understand it, the error output will be useful for those trying to help you. — Robert Crovella, May 22 '17 at 14:18

score 1 · Answer 1 · answered May 22 '17 at 13:12

As talonmies said, you are using dynamic shared memory, but you are not allocating any memory space for it. You have to specify the size of this memory as the third argument of your kernel execution configuration.

reduce1<<<1, THREADS_PER_BLOCK, 64>>>(deviceInput, deviceOutput);
                                ^^

Another way to fix this code is to use static shared memory. Declare your shared memory like this:

__shared__ int sdata[16];

Please read this before asking questions for CUDA.

Result of reduction#1 is wrong

1 Answers1