I'm writing a program to convert rgba image to greyscale. I've worked much on this and have correctly implemented the kernel. However, the grid size is possible wrong, even though its correct by my logic.
The kernel:
__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
unsigned char* const greyImage,
int numRows, int numCols)
{
int x = (blockIdx.x * blockDim.x) + threadIdx.x;
int y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(x >= numCols || y >= numRows)
return;
uchar4 rgba = rgbaImage[x+y];
float channelSum = 0.299f*rgba.x + 0.587f*rgba.y + 0.114f*rgba.z;
greyImage[x+y] = channelSum;
}
and the kernel launch:
const dim3 blockSize(10, 10, 1); //TODO
size_t gridSizeX, gridSizeY;
gridSizeX = numCols + (10 - (numCols % 10) ); //adding some number to make it multiple of 10
gridSizeY = numRows + (10 - (numRows % 10) ); //adding some number to make it multiple of 10
const dim3 gridSize( gridSizeX, gridSizeY, 1); //TODO
rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);
I'm creating more number of threads then required and then applying a bound check in the kernel.