Cuda Reduction in 2d Array

Question

I want to calculate the average of the values over the whole image in Cuda. To test how reduction in 2D array work, I write this kernel below. The final output o should be the sum of all the image values. The input g is a 2D array with value 1 in every pixel. But the result of this program is 0 as the sum. A bit weird to me.

I imitate the reduction in 1D array in this tutorial http://developer.download.nvidia.com/compute/cuda/1.1-Beta/x86_website/projects/reduction/doc/reduction.pdf I write this 2D form. I am new to Cuda. And suggestions to potential bugs and improvement are welcomed!

Just add one comment. I know it makes sense just to calculate the average in 1D array. But I want to exploit more and test more complicated reduction behaviours. It might not be right. But just a test. Hope anyone can give me suggestions more about reduction common practices.

#include <iostream>
#include <stdio.h>
#include <stdlib.h>
#include <time.h>

cudaEvent_t start, stop;
float elapsedTime;

__global__ void 
reduce(float *g, float *o, const int dimx, const int dimy)
{
extern __shared__ float sdata[];

unsigned int tid_x = threadIdx.x;
unsigned int tid_y = threadIdx.y;

unsigned int i = blockDim.x * blockIdx.x + threadIdx.x;
unsigned int j = blockDim.y * blockIdx.y + threadIdx.y; 

if (i >= dimx || j >= dimy)
    return;

sdata[tid_x*blockDim.y + tid_y] = g[i*dimy + j];

__syncthreads();

for(unsigned int s_y = blockDim.y/2; s_y > 0; s_y >>= 1)
{
    if (tid_y < s_y)
    {
        sdata[tid_x * dimy + tid_y] += sdata[tid_x * dimy + tid_y + s_y];
    }
    __syncthreads();
}

for(unsigned int s_x = blockDim.x/2; s_x > 0; s_x >>= 1 )
{

    if(tid_x < s_x)
    {
        sdata[tid_x * dimy] += sdata[(tid_x + s_x) * dimy];
    }
    __syncthreads();
}

float sum;

if( tid_x == 0 && tid_y == 0)
{ 
    sum = sdata[0];
    atomicAdd (o, sum);   // The result should be the sum of all pixel values. But the program produce 0
}

//if(tid_x==0 && tid__y == 0 ) 
    //o[blockIdx.x] = sdata[0];
}

int
main()
{   
int dimx = 320;
int dimy = 160;
int num_bytes = dimx*dimy*sizeof(float);

float *d_a, *h_a, // device and host pointers
            *d_o=0, *h_o=0;

h_a = (float*)malloc(num_bytes);
h_o = (float*)malloc(sizeof(float));

srand(time(NULL));


for (int i=0; i < dimx; i++)
{   
    for (int j=0; j < dimy; j++)
    {
        h_a[i*dimy + j] = 1;
    }
}

cudaMalloc( (void**)&d_a, num_bytes );
cudaMalloc( (void**)&d_o, sizeof(int) );

cudaMemcpy( d_a, h_a, num_bytes, cudaMemcpyHostToDevice);
cudaMemcpy( d_o, h_o, sizeof(int), cudaMemcpyHostToDevice); 

dim3 grid, block;
block.x = 4;
block.y = 4;
grid.x = dimx / block.x;
grid.y = dimy / block.y;

cudaEventCreate(&start);
cudaEventRecord(start, 0);

int sizeofSharedMemory = dimx*dimy*sizeof(float);

reduce<<<grid, block, sizeofSharedMemory>>> (d_a, d_o, block.x, block.y);

cudaEventCreate(&stop);
cudaEventRecord(stop, 0);
cudaEventSynchronize(stop);

cudaEventElapsedTime(&elapsedTime, start, stop);
std::cout << "This kernel runs: " << elapsedTime << "ms" << std::endl; 

std::cout << block.x << " " << block.y << std::endl;
std::cout << grid.x << " " << grid.y << std::endl;
std::cout << dimx <<  " " << dimy << " " << dimx*dimy << std::endl;

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
cudaMemcpy( h_o, d_o, sizeof(int), cudaMemcpyDeviceToHost );

std::cout << "The sum is:" << *h_o << std::endl;

free(h_a);
free(h_o);
cudaFree(d_a);
cudaFree(d_o);

}

A `2D` array of `MxN` elements can be regarded as a `1D` array of `MN` elements. So, why don't you exploit already developed `CUDA` reduction codes for `1D` arrays? You could have a look ad the `CUDA SDK reduction` example. Concerning your code, could you post a full version, so that someone could compile and execute it and give you more insight? — Vitality, Jul 20 '13 at 12:40
Thank you for your feedback. I put the whole program to it. Since it is a test program for me to learn the best of cuda, I try to write my own. — Lv Zhaoyang, Jul 20 '13 at 12:47
A very fast comment. You are not initializing `h_o` and so `d_o`, but this is not the only problem in this code. — Vitality, Jul 20 '13 at 12:58
Thank you. I initialized it to be *h_o = 0, *d_o here. Isn't it correct to be here? Actually I tried the 1D array reduction with the same main function, and it worked. — Lv Zhaoyang, Jul 20 '13 at 13:03
You are printing out `*h_0` *before* copying its value from the device! — talonmies, Jul 20 '13 at 13:50

score 11 · Answer 1 · edited May 23 '17 at 10:29

If you do basic cuda error checking you will discover that your reduce kernel is not even running. The reason is as follows:

int dimx = 320;
int dimy = 160;
...
int sizeofSharedMemory = dimx*dimy*sizeof(float); // = 204800

reduce<<<grid, block, sizeofSharedMemory>>> (d_a, d_o, block.x, block.y);
                          ^
                          |
                         204800 is illegal here

You cannot request 204800 bytes of shared memory dynamically (or any other way). The maximum is slightly less than 48K bytes.

If you had done proper cuda error checking, you would discover your kernel is not running and would have gotten an instructive error message which suggests the launch configuration (the numbers between the <<< ... >>> ) is invalid. Shared memory is requested on a per-block basis, and it's probably not sensible that you need to request enough shared memory to cover your entire 2D data set, when each block only consists of a 4x4 thread array. You probably just need enough data for what will be accessed by each 4x4 thread array.

After you have properly instrumented your code with cuda error checking, and detected and corrected all the errors, then run your code with cuda-memcheck. This will do an additional level of error checking to point out any kernel access errors. You may also use cuda-memcheck if you are getting an unspecified launch failure, and it may help pinpoint the issue.

After you have done these basic trouble shooting steps, then it might make sense to ask others for help. But use the power of the tools you have been given first.

I also want to point out one other error before you come back and post this code again, asking for help.

This will not be useful:

std::cout << "The sum is:" << *h_o << std::endl;

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
cudaMemcpy( h_o, d_o, sizeof(int), cudaMemcpyDeviceToHost );

You are printing out the sum before you have copied the sum from the device to the host. Reverse the order of these steps:

cudaMemcpy( h_a, d_a, num_bytes, cudaMemcpyDeviceToHost );
cudaMemcpy( h_o, d_o, sizeof(int), cudaMemcpyDeviceToHost );

std::cout << "The sum is:" << *h_o << std::endl;

Cuda Reduction in 2d Array

1 Answers1