A simple reduction program in CUDA

Question

In the below code, I am trying to implement a simple parallel reduction with blocksize and number of threads per block being 1024. However, after implementing partial reduction, I wish to see whether my implementation is going right or not and in that process I make the program print the first element of the host memory (after data has been copied from device memory to host memory). My host memory is initialize with '1' and is copied to device memory for reduction. And the printf statement after the reduction process still gives me '1' at the first element of the array.

Is there a problem in what I am getting to print or is it something logical in the implementation of reduction? In addition printf statements in the kernel do not print anything. Is there something wrong in my syntax or the call to the printf statement? My code is as below:

    ifndef CUDACC
define CUDACC
endif
include "cuda_runtime.h"
include "device_launch_parameters.h"
include
include
ifndef THREADSPERBLOCK
define THREADSPERBLOCK 1024
endif
ifndef NUMBLOCKS
define NUMBLOCKS 1024
endif

global void reduceKernel(int *c)
{
extern shared int sh_arr[];

int index = blockDim.x*blockIdx.x + threadIdx.x;
int sh_index = threadIdx.x;

// Storing data from Global memory to shared Memory
sh_arr[sh_index] = c[index];
__syncthreads();

for(unsigned int i = blockDim.x/2; i>0 ; i>>=1)
{
    if(sh_index < i){
        sh_arr[sh_index] += sh_arr[i+sh_index];
    }
    __syncthreads();
}

if(sh_index ==0)
    c[blockIdx.x]=sh_arr[sh_index];
printf("value stored at %d is %d \n", blockIdx.x, c[blockIdx.x]);
return;

}

int main()
{
int *h_a;
int *d_a;
int share_memSize, h_memSize;
size_t d_memSize;

share_memSize = THREADSPERBLOCK*sizeof(int);
h_memSize = THREADSPERBLOCK*NUMBLOCKS;

h_a = (int*)malloc(sizeof(int)*h_memSize);

d_memSize=THREADSPERBLOCK*NUMBLOCKS;
cudaMalloc( (void**)&d_a, h_memSize*sizeof(int));

for(int i=0; i<h_memSize; i++)
{
    h_a[i]=1;    
};

//printf("last element of array %d \n", h_a[h_memSize-1]);

cudaMemcpy((void**)&d_a, (void**)&h_a, h_memSize, cudaMemcpyHostToDevice);
reduceKernel<<<NUMBLOCKS, THREADSPERBLOCK, share_memSize>>>(d_a);
cudaMemcpy((void**)&h_a, (void**)&d_a, d_memSize, cudaMemcpyDeviceToHost);

printf("sizeof host memory %d \n", d_memSize); //sizeof(h_a));
printf("sum after reduction %d \n", h_a[0]);

}

score 1 · Answer 1 · edited May 23 '17 at 12:22

There are a number of problems with this code.

much of what you've posted is not valid code. As just a few examples, your global and shared keywords are supposed to have double-underscores before and after, like this: __global__ and __shared__. I assume this is some sort of copy-paste error or formatting error. There are problems with your define statements as well. You should endeavor to post code that doesn't have these sorts of problems.
Any time you are having trouble with a CUDA code, you should use proper cuda error checking and run your code with cuda-memcheck before asking for help. If you had done this , it would have focused your attention on item 3 below.
Your cudaMemcpy operations are broken in a couple of ways:
```
cudaMemcpy((void**)&d_a, (void**)&h_a, h_memSize, cudaMemcpyHostToDevice);
```
First, unlike cudaMalloc, but like memcpy, cudaMemcpy just takes ordinary pointer arguments. Second, the size of the transfer (like memcpy) is in bytes, so your sizes need to be scaled up by sizeof(int):
```
cudaMemcpy(d_a, h_a, h_memSize*sizeof(int), cudaMemcpyHostToDevice);
```
and similarly for the one after the kernel.
printf from every thread in a large kernel (like this one which has 1048576 threads) is probably not a good idea. You won't actually get all the output you expect, and on windows (appears you are running on windows) you may run into a WDDM watchdog timeout due to kernel execution taking too long. If you need to printf from a large kernel, be selective and condition your printf on threadIdx.x and blockIdx.x
The above things are probably enough to get some sensible printout, and as you point out you're not finished yet anyway: "I wish to see whether my implementation is going right or not ". However, this kernel, as crafted, overwrites its input data with output data:
```
__global__ void reduceKernel(int *c)
...
    c[blockIdx.x]=sh_arr[sh_index];
```
This will lead to a race condition. Rather than trying to sort this out for you, I'd suggest separating your output data from your input data. Even better, you should study the cuda reduction sample code which also has an associated presentation.

Here is a modified version of your code which has most of the above issues fixed. It's still not correct. It still has defect 5 above in it. Rather than completely rewrite your code to fix defect 5, I would direct you to the cuda sample code mentioned above.

$ cat t820.cu
#include <stdio.h>

#ifndef THREADSPERBLOCK
#define THREADSPERBLOCK 1024
#endif
#ifndef NUMBLOCKS
#define NUMBLOCKS 1024
#endif

__global__ void reduceKernel(int *c)
{
extern __shared__ int sh_arr[];

int index = blockDim.x*blockIdx.x + threadIdx.x;
int sh_index = threadIdx.x;

// Storing data from Global memory to shared Memory
sh_arr[sh_index] = c[index];
__syncthreads();

for(unsigned int i = blockDim.x/2; i>0 ; i>>=1)
{
    if(sh_index < i){
        sh_arr[sh_index] += sh_arr[i+sh_index];
    }
    __syncthreads();
}

if(sh_index ==0)
    c[blockIdx.x]=sh_arr[sh_index];
// printf("value stored at %d is %d \n", blockIdx.x, c[blockIdx.x]);
return;

}

int main()
{
int *h_a;
int *d_a;
int share_memSize, h_memSize;
size_t d_memSize;

share_memSize = THREADSPERBLOCK*sizeof(int);
h_memSize = THREADSPERBLOCK*NUMBLOCKS;

h_a = (int*)malloc(sizeof(int)*h_memSize);

d_memSize=THREADSPERBLOCK*NUMBLOCKS;
cudaMalloc( (void**)&d_a, h_memSize*sizeof(int));

for(int i=0; i<h_memSize; i++)
{
    h_a[i]=1;
};

//printf("last element of array %d \n", h_a[h_memSize-1]);

cudaMemcpy(d_a, h_a, h_memSize*sizeof(int), cudaMemcpyHostToDevice);
reduceKernel<<<NUMBLOCKS, THREADSPERBLOCK, share_memSize>>>(d_a);
cudaMemcpy(h_a, d_a, d_memSize*sizeof(int), cudaMemcpyDeviceToHost);

printf("sizeof host memory %d \n", d_memSize); //sizeof(h_a));
printf("first block sum after reduction %d \n", h_a[0]);
}
$ nvcc -o t820 t820.cu
$ cuda-memcheck ./t820
========= CUDA-MEMCHECK
sizeof host memory 1048576
first block sum after reduction 1024
========= ERROR SUMMARY: 0 errors
$

For the part using __global__ and __shared__, it was a copy paste issue and was there in my local file. Thanks for the part where I made the mistake of not passing bytes in the form of size. And as for the 5th comment, I have given the function one additional argument which would store the output, and made the input const. Its working!! Thank You — codeahead, Jun 29 '15 at 17:33
hello, would it be possible to use a value reduction technic only for incrementing a variable ? — SOCKet, Aug 11 '15 at 09:31
I don't know what a value reduction technic only for incrementing a variable is. — Robert Crovella, Aug 11 '15 at 12:48

A simple reduction program in CUDA

1 Answers1