Finding min and max by the cuda kernel reduction

Question

Here is my kernel code

typedef unsigned char Npp8u;
...
    // Kernel Implementation
__device__ unsigned int min_device;
__device__ unsigned int max_device;


__global__ void findMax_Min(Npp8u * data, int numEl){
    int index = blockDim.x*blockIdx.x + threadIdx.x;
    int shared_index = threadIdx.x;

    __shared__ Npp8u data_shared_min[BLOCKDIM];
    __shared__ Npp8u data_shared_max[BLOCKDIM];

    // check index condition
    if(index < numEl){
        data_shared_min[shared_index] = data[index]; //pass values from global to shared memory
        __syncthreads();
        data_shared_max[shared_index] = data[index]; //pass values from global to shared memory


        for (unsigned int stride = BLOCKDIM/2; stride > 0; stride >>= 1) {
            if(threadIdx.x <  stride){
                if(data_shared_max[threadIdx.x] <  data_shared_max[threadIdx.x+stride]) data_shared_max[shared_index] = data_shared_max[shared_index+stride];
                if(data_shared_min[threadIdx.x]>  data_shared_min[threadIdx.x+stride]) data_shared_min[shared_index] = data_shared_min[shared_index+stride];
            }
            __syncthreads();
        }
        if(threadIdx.x == 0  ){
            atomicMin(&(min_device), (unsigned int)data_shared_min[threadIdx.x ]);
            //min_device =10;
            __syncthreads();
            atomicMax(&(max_device), (unsigned int)data_shared_max[threadIdx.x ]);
        }
    }else{
        data_shared_min[shared_index] = 9999;
    }
}

I have an image that is 512x512 and I want to find the min and max pixel values. data is the 1-D version of the image. This code works for max but not for min value. As I checked from matlab max value is 202 and min value is 10 but it finds 0 for the min value. Here is my kernel codes and memcpy calls

int main(){
    // Host parameter declarations.
    Npp8u * imageHost;
    int   nWidth, nHeight, nMaxGray;

    // Load image to the host.
    std::cout << "Load PGM file." << std::endl;
    imageHost = LoadPGM("lena_before.pgm", nWidth, nHeight, nMaxGray);

    // Device parameter declarations.
    Npp8u    * imageDevice;
    unsigned int   max, min;
    size_t size = sizeof(Npp8u)*nWidth*nHeight;

    cudaMalloc((Npp8u**)&imageDevice, size);

    cudaMemcpy(imageDevice, imageHost, size, cudaMemcpyHostToDevice);

    int numPixels = nWidth*nHeight;

    dim3 numThreads(BLOCKDIM);
    dim3 numBlocks(numPixels/BLOCKDIM + (numPixels%BLOCKDIM == 0 ? 0 : 1));

    findMax_Min<<<numBlocks, numThreads>>>(imageDevice,numPixels);
    cudaMemcpyFromSymbol(&max,max_device, sizeof(max_device), 0, cudaMemcpyDeviceToHost);
    cudaMemcpyFromSymbol(&min,min_device, sizeof(min_device), 0, cudaMemcpyDeviceToHost);


    printf("Min value for image : %i\n", min);
    printf("Max value for image : %i\n", max);
...

Another interesting thing is changing the order of cudaMemcpy just after the kernel call also causes malfunctioning and values both are read as zero. I do not see the problem. Is there anyone sees the obstructed part?

You might want to do [cuda error checking](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api). You might also want to initialize `min_device` to a large value and `max_device` to zero. There are other problems with your reduction method related to stride, but I don't think it matters for a 512x512 image. — Robert Crovella, Apr 20 '13 at 17:03
For the cudamemcpy calls at the end, you are copying 4 bytes (size of max_device) into a one-byte variable (Npp8u max) and likewise for min. So that's a problem. — Robert Crovella, Apr 20 '13 at 17:13

Robert Crovella · Answer 1 · 2013-04-20T18:27:37.677

You might want to do cuda error checking. You might also want to initialize min_device to a large value and max_device to zero. There are other problems with your reduction method related to stride (what happens in the last block of an odd size image when you add stride to threadIdx.x, it may exceed the defined image range in shared memory) , but I don't think it matters for a 512x512 image. If min_device just happened to start out at zero, all of your atomicMin operations would always leave zero there.

You can try initializing min_device and max_device like this:

__device__ unsigned int min_device = 9999;
__device__ unsigned int max_device = 0;

For the cudamemcpy calls at the end, you are copying 4 bytes (size of max_device) into a one-byte variable (Npp8u max) and likewise for min. So that's a problem. Since you're using pointers, the copy operation is definitely overwriting something that you don't intend. If the compiler stores the variables sequentially the way you have them defined, one copy operation is overwriting the other variable, which I think would explain the behavior you're seeing. If you created min and max as unsigned int quantities, I think this problem would go away.

EDIT: Since you haven't shown your actual block dimensions, it's possible that you still have a problem with your reduction. You might want to change this line:

        if(threadIdx.x <  stride){

To something like:

        if((threadIdx.x <  stride) && ((index + stride)< numEl)){

This or something like it should correct the hazard I mention in the first paragraph. I guess you're trying to account for the hazard using this line:

    data_shared_min[shared_index] = 9999;

But there's no guarantee that line of code gets executed before the data element that it is setting in shared memory gets read by some other thread. I also don't know what happens when you assign a value of 9999 to a byte quantity, but it's probably not what you expect.

for the second case you are right and changing max and min simply works but still does not work for the problem of getting 0 minimum. — erogol, Apr 20 '13 at 17:55
Maybe you should update the question with your updated code. Can you show how you initialize `min_device`? — Robert Crovella, Apr 20 '13 at 18:02
I've updated my answer with a few more comments. I see you've updated your code for to reflect `unsigned int` for `max` and `min`, but I don't see anywhere that you are initializing `min_device` and `max_device`. It seems you're not comprehending my answer. Changing `max` and `min` definitions will *not* fix the min value of zero issue. *You need to initialize both `min_device` to a large value and `max_device` to zero.* — Robert Crovella, Apr 20 '13 at 18:26

Finding min and max by the cuda kernel reduction

1 Answers1