What are the right Grid and Block Dimensions for 2D triangular smooth in CUDA?

Question

I have a sequential smoothing algorithm

void triangularSmooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter, NSTimer &timer, dim3 grid_size, dim3 block_size) {
for ( int y = 0; y < height; y++ ) {
    for ( int x = 0; x < width; x++ ) {
        unsigned int filterItem = 0;
        float filterSum = 0.0f;
        float smoothPix = 0.0f;

        for ( int fy = y - 2; fy < y + 3; fy++ ) {
            for ( int fx = x - 2; fx < x + 3; fx++ ) {
                if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
                    filterItem++;
                    continue;
                }

                smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
                filterSum += filter[filterItem];
                filterItem++;
            }
        }

        smoothPix /= filterSum;
        smoothImage[(y * width) + x] = static_cast< unsigned char >(smoothPix);
    }
}
}

I am implementing in CUDA and wish to use a shared variable to hold the pixels in grayImage. However before that, I'm trying to run it as it is. To this end I have kernel code:

__global__ void smooth(unsigned char *grayImage, unsigned char *smoothImage, const int width, const int height, const float *filter)
{

        int x = blockIdx.x*blockDim.x + threadIdx.x;
        int y = blockIdx.y*blockDim.y + threadIdx.y;

        unsigned int filterItem = 0;
        float filterSum = 0.0f;
        float smoothPix = 0.0f;

        for ( int fy = y - 2; fy < y + 3; fy++ ) {
            for ( int fx = x - 2; fx < x + 3; fx++ ) {
                if ( ((fy < 0) || (fy >= height)) || ((fx < 0) || (fx >= width)) ) {
                    filterItem++;
                    continue;
                }

                smoothPix += grayImage[(fy * width) + fx] * filter[filterItem];
                filterSum += filter[filterItem];
                filterItem++;
            }
        }
        smoothPix /= filterSum;
        smoothImage[(y * width) + x] = static_cast< unsigned char >(smoothPix);
}

And calling with:

const float filter[] = {1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 2.0f, 3.0f, 2.0f, 1.0f, 1.0f, 2.0f, 2.0f, 2.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f, 1.0f};
dim3 gridSize((width*height)/1024,(width*height)/1024,1);
dim3 blockSize(256,256,1);
smooth <<< gridSize, blockSize >>> (grayImage, smoothImage, width, height, filter);
cudaDeviceSynchronize();

The problem is that, the resulting smooth image looking like the pixels are all in the wrong other (mixed up). Is this from the dimensions of the grid and block? I've tried a LOT of other possible dimensions. What would be the right way?

I'm using a GTX480, version - 2.x, Maximum dimensionality of grid of thread blocks - 3, Maximum x-, y-, or z-dimension of a grid of thread blocks - 65535, Maximum Number of Threads per Block - 1024

Your kernel is never running because the blocksize is illegal. If you add some error checking to your code you will see the kernel launch fails with an invalid configuration error. — talonmies, Feb 06 '13 at 21:47
i have a cudaGetLastError(); immediately after cudaDeviceSynchronize(); and it returns no errors — Francis Saa-Dittoh, Feb 06 '13 at 21:51
Please read [this question and answer](http://stackoverflow.com/questions/14038589/what-is-the-canonical-way-to-check-for-errors-using-the-cuda-runtime-api) for the correct way to check for errors during a kernel launch. Note in your question you have said your GPU has a limit of 1024 threads per block, and you are asking for 256*256*1 threads per block..... — talonmies, Feb 06 '13 at 22:04
Was not checking properly; you are right! However, I also get an error with (16,16,1) or (32,32,1) — Francis Saa-Dittoh, Feb 06 '13 at 22:04
Drawing from your comment, my gridsize is also invalid. (my images are 1024x1024 and above). Will correct it and get back. — Francis Saa-Dittoh, Feb 06 '13 at 22:14
I have tried using `dim3 gridSize1((width + blockSize1.x - 1)/ blockSize1.x, (height + blockSize1.y - 1) / blockSize1.y, 1);` and blocksize (16,16,1), which should be within the limits. I however still get "_unspecified launch failure_" — Francis Saa-Dittoh, Feb 06 '13 at 22:20
unspecified launch failure means out of bounds memory access in your kernel somwhere. run cuda-memcheck to see where. — talonmies, Feb 06 '13 at 22:21
"Invalid __global__ read of size 4". From multiple threads. Trying to find which variable is causing it to go out of bounds — Francis Saa-Dittoh, Feb 06 '13 at 22:39
Got it working! thanks. something to do with the filter array not being passed properly. Works from within the kernel code. — Francis Saa-Dittoh, Feb 06 '13 at 23:06
So that this question isn't a complete waste of everyone's time, please write your solution into an answer. You will later be able to accept that answer, which marks the question as answered — talonmies, Feb 07 '13 at 02:40

score 1 · Answer 1 · edited May 23 '17 at 12:27

Looking at this answer related to image filtering, I would recommend that you create the block and grid for the image like this:

dim3 blockSize(16,16,1);
dim3 gridSize((width + blockSize.x - 1)/blockSize.x,(height + blockSize.y - 1)/blockSize.y,1);

Another very common mistake that you are making is that the filter array you are passing to the kernel, is allocated on the host. Create an array of same size on the device and copy the coefficients from host to device. Pass that device array to the kernel.

Also, it is highly recommended to calculate the sum of filter coefficients on host side and pass it as an argument to the kernel instead of calculating the sum again and again in each thread.

The boundary conditions may cause out of range memory access. Handle the boundary conditions explicitly in the kernel. Or the easy approach is to use CUDA texture for the input image so that boundary conditions are handled automatically.

score 1 · Accepted Answer · answered Feb 07 '13 at 09:14

First, the dimensions are totally invalid. The following should work in this case;

dim3 blockSize(16, 16, 1);
dim3 gridSize((width + blockSize.x - 1)/ blockSize.x, (height + blockSize.y - 1) / blockSize.y, 1);
smooth <<< grid_size, block_size >>> (grayImage, smoothImage, width, height);

After the correction, using cuda-memcheck yielded results similar to;

========= Invalid __global__ read of size 4
=========     at 0x00000120 in cudaFilter
=========     by thread (4,1,0) in block (1,0,0)
=========     Address 0x05100190 is out of bounds

This shows that a value within the kernel code is out of bounds (most possibly an array index). Checking the various variables led to determine that filter[] was empty.

Lastly, if filter[] is to be passed to the kernel, it should be copied from CPU to GPU using something like

cudaMemcpy(filterGpu, filter, 25 * sizeof(float), cudaMemcpyHostToDevice);

Alternatively, if the filter is not needed anywhere else (as is the case here), it can be declared within the kernel instead.

What are the right Grid and Block Dimensions for 2D triangular smooth in CUDA?

2 Answers2