I am new to Cuda programming, I have a code that converts an RGB image to Greyscale. The algorithm for reading RGB values of pixel and converting them to GreyScale has been provided to us. Parallelizing the code has given me around 40-50x speed up.I want to optimize it further to achieve around 100x speedup. For this purpose I want to use shared memory access as its magnitude faster than Global Memory Access. I have gone through different online resources and have the basic understanding of shared memory access. But in my code I am having problem understanding how to implement shared memory, The code to read RGB values and converting to Greyscale
for ( int y = 0; y < height; y++ ) {
for ( int x = 0; x < width; x++ ) {
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * width) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * width) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * width) + x]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[(y * width) + x] = static_cast< unsigned char >(grayPix);
}
}
Input image a char* and we are using CImg library to manipulate image
CImg< unsigned char > inputImage = CImg< unsigned char >(argv[1]);
Where user passes the path to image as a argument while running the code
This is my Cuda implementation of it
unsigned int y = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
float grayPix = 0.0f;
float r = static_cast< float >(inputImage[(y * height) + x]);
float g = static_cast< float >(inputImage[(width * height) + (y * height) + x]);
float b = static_cast< float >(inputImage[(2 * width * height) + (y * height) + x]);
grayPix = ((0.3f * r) + (0.59f * g) + (0.11f * b));
grayPix = (grayPix * 0.6f) + 0.5f;
darkGrayImage[(y * height) + x] = static_cast< unsigned char >(grayPix);
The Grid and block and calling the code
dim3 gridSize(width/16,height/16);
dim3 blockSize(16,16);
greyScale<<< gridSize, blockSize >>>(width,height,d_in, d_out);
where width and height are the width and height of input image. I tried with block size of (32,32) but it slowed down the code instead of speeding it up
Now i Want to add shared memory but the problem the access to the input variable InputImage is quite non linear, so what values do I add to the shared memory I tried something like
unsigned int y = (blockIdx.x * blockDim.x) + threadIdx.x;
unsigned int x = (blockIdx.y * blockDim.y) + threadIdx.y;
extern __shared__ int s[];
s[x]=inputImage[x];
__syncthreads();
and then replacing inputImage with s in the implementation but that just gave a wrong output (all black image) Can you help me out here to understand how can i implement shared memory, if even its possible and useful and is there a way i can make my access in a more coalesced way ?
Any help would be grateful