colored image to greyscale image using CUDA parallel processing

Question

I am trying to solve a problem in which i am supposed to change a colour image to a greyscale image. For this purpose i am using CUDA parallel approach.

The kerne code i am invoking on the GPU is as follows.

__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                   unsigned char* const greyImage,
                   int numRows, int numCols)
{
    int absolute_image_position_x = blockIdx.x;  
    int absolute_image_position_y = blockIdx.y;

  if ( absolute_image_position_x >= numCols ||
   absolute_image_position_y >= numRows )
 {
     return;
 }
uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y];
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[absolute_image_position_x + absolute_image_position_y] = channelSum;

}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage,
                            uchar4 * const d_rgbaImage,
                            unsigned char* const d_greyImage,
                            size_t numRows,
                            size_t numCols)
{
  //You must fill in the correct sizes for the blockSize and gridSize
  //currently only one block with one thread is being launched
  const dim3 blockSize(numCols/32, numCols/32 , 1);  //TODO
  const dim3 gridSize(numRows/12, numRows/12 , 1);  //TODO
  rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage,
                                             d_greyImage,
                                             numRows,
                                             numCols);

  cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

i see a line of dots in the first pixel line.

error i am getting is

libdc1394 error: Failed to initialize libdc1394
Difference at pos 51 exceeds tolerance of 5
Reference: 255
GPU : 0
my input/output images Can anyone help me with this??? thanks in advance.

Please give your question a more meaningful title. As it stands it means absolutely nothing to anyone but you. How would someone with a similar image processing question *ever* find this by searching? — talonmies, Feb 05 '13 at 16:41
This is an assignment from the "Introduction to Parallel Programming" course on Udacity. You should solve it yourself and not use stackowerflow to get is solved for you by others. — RoBiK, Feb 05 '13 at 22:07
@RoBiK : i was just curious and was simultaneously trying it myself and as far as this "getting it solved for you by others" is concerned i don't think my aim is to submit the answer on udacity and make it count for grades but it has got more to do with discussing with others in programming community and learn form their expertise hope that makes sense to you. — Ashish Singh, Feb 06 '13 at 04:13

score 6 · Answer 1 · answered Oct 17 '15 at 14:46

I recently joined this course and tried your solution but it don't work so, i tried my own. You are almost correct. The correct solution is this:

__global__`
void rgba_to_greyscale(const uchar4* const rgbaImage,
               unsigned char* const greyImage,
               int numRows, int numCols)
{`

int pos_x = (blockIdx.x * blockDim.x) + threadIdx.x;
int pos_y = (blockIdx.y * blockDim.y) + threadIdx.y;
if(pos_x >= numCols || pos_y >= numRows)
    return;

uchar4 rgba = rgbaImage[pos_x + pos_y * numCols];
greyImage[pos_x + pos_y * numCols] = (.299f * rgba.x + .587f * rgba.y + .114f * rgba.z); 

}

The rest is same as your code.

nevermind: this answered my question https://stackoverflow.com/questions/2151084/map-a-2d-array-onto-a-1d-array-c — labheshr, Sep 03 '17 at 16:06

Ashish Singh · Accepted Answer · 2014-04-16T12:36:06.127

Now, since I posted this question I have been continuously working on this problem
there are a couple of improvements that should be done in order to get this problem correct now I realize my initial solution was wrong .
Changes to be done:-

 1. absolute_position_x =(blockIdx.x * blockDim.x) + threadIdx.x;
 2. absolute_position_y = (blockIdx.y * blockDim.y) + threadIdx.y;

Secondly,

 1. const dim3 blockSize(24, 24, 1);
 2. const dim3 gridSize((numCols/16), (numRows/16) , 1);

In the solution we are using a grid of numCols/16 * numCols/16
and blocksize of 24 * 24

code executed in 0.040576 ms

@datenwolf : thanks for answering above!!!

any idea why the blockSize needs to be `24,24` and gridSize `numCols/16, numRows/16`? Is there a reason why? Can other number work? — alvas, Jul 29 '15 at 09:55

score 2 · Answer 3 · edited Sep 06 '16 at 19:36

Since you are not aware of the image size. It is best to choose any reasonable dimension of the two-dimensional block of threads and then check for two conditions. The first one is that the pos_x and pos_y indexes in the kernel do not exceed numRows and numCols. Secondly the grid size should be just above the total number of threads in all the blocks.

const dim3 blockSize(16, 16, 1);
const dim3 gridSize((numCols%16) ? numCols/16+1 : numCols/16,
(numRows%16) ? numRows/16+1 : numRows/16, 1);

score 1 · Answer 4 · answered Feb 05 '13 at 16:07

1

libdc1394 error: Failed to initialize libdc1394

I don't think that this is a CUDA problem. libdc1394 is a library used to access IEEE1394 aka FireWire aka iLink video devices (DV camcorders, Apple iSight camera). That library doesn'r properly initialize, hence you're not getting usefull results. Basically it's NINO: Nonsens In Nonsens Out.

answered Feb 05 '13 at 16:07

datenwolf

159,371
13
185
298

@datewolf please see i have added a link to input/output image output i am getting. – Ashish Singh Feb 05 '13 at 16:14
what i see is an error at pos 51 exceeds tolernace of 5 so i am guessing if its related to color pattern and not any other linker type error. – Ashish Singh Feb 05 '13 at 16:19
@ashish173: It's not a linker problem, it's a runtime problem. The dc1394 library fails to initialize properly upon program startup and will likely produce only garbage when used to retrieve pictures. You must first fix that initialization problem (this is a runtime thing, i.e. something you must code). – datenwolf Feb 05 '13 at 17:26

score 1 · Answer 5 · answered May 30 '13 at 04:58

the calculation of absolute x & y image positions is perfect. but when u need to access that particular pixel in the coloured image , shouldn't you u use the following code??

uchar4 rgba = rgbaImage[absolute_image_position_x + (absolute_image_position_y * numCols)];

I thought so, when comparing it to a code you'd write to execute the same problem in serial code. Please let me know :)

score 1 · Answer 6 · answered Oct 14 '13 at 06:50

You still should have a problem with run time - the conversion will not give a proper result.

The lines:

uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y];
greyImage[absolute_image_position_x + absolute_image_position_y] = channelSum;

should be changed to:

uchar4 rgba = rgbaImage[absolute_image_position_x + absolute_image_position_y*numCols];
greyImage[absolute_image_position_x + absolute_image_position_y*numCols] = channelSum;

score 1 · Answer 7 · answered Mar 10 '14 at 07:15

__global__
void rgba_to_greyscale(const uchar4* const rgbaImage,
                       unsigned char* const greyImage,
                       int numRows, int numCols)
{
    int rgba_x = blockIdx.x * blockDim.x + threadIdx.x;
    int rgba_y = blockIdx.y * blockDim.y + threadIdx.y;
    int pixel_pos = rgba_x+rgba_y*numCols;

    uchar4 rgba = rgbaImage[pixel_pos];
    unsigned char gray = (unsigned char)(0.299f * rgba.x + 0.587f * rgba.y + 0.114f * rgba.z);
    greyImage[pixel_pos] = gray;
}

void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
                            unsigned char* const d_greyImage, size_t numRows, size_t numCols)
{
    //You must fill in the correct sizes for the blockSize and gridSize
    //currently only one block with one thread is being launched
    const dim3 blockSize(24, 24, 1);  //TODO
    const dim3 gridSize( numCols/24+1, numRows/24+1, 1);  //TODO
    rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage, numRows, numCols);

    cudaDeviceSynchronize(); checkCudaErrors(cudaGetLastError());
}

although you may get the right answer, you do this in a very weird way..You pass in columns where rows need to be passed into your gridsize, and your formula for pixel_pos does not tie with the std. way of flattening a 2d array into 1d array...it should either be numRows*y + x, or numCols*x+y, but it all works out b/c your gird is set to cols, rows instead of rows, cols — labheshr, Sep 03 '17 at 16:05

score 1 · Answer 8 · answered Jul 12 '15 at 05:17

The libdc1394 error is not related to firewire etc in this case - it is the library that udacity is using to compare the image your program creates to the reference image. And what is is saying is that the difference between your image and the reference image has been been exceeded by a specific threshold, for that position ie. pixel.

score 0 · Answer 9 · answered Feb 06 '13 at 00:20

You are running following number of block and grids:

  const dim3 blockSize(numCols/32, numCols/32 , 1);  //TODO
  const dim3 gridSize(numRows/12, numRows/12 , 1);  //TODO

yet you are not using any threads in your kernel code!

 int absolute_image_position_x = blockIdx.x;  
 int absolute_image_position_y = blockIdx.y;

think this way, the width of an image can be divide into absolute_image_position_x parts of column and the height of an image can be divide into absolute_image_position_y parts of row. Now the box each of the cross section it creates you need to change/redraw all the pixels in terms of greyImage, parallely. Enough spoiler for an assignment :)

thnks for answering i've figured it out ,i wasn't using any threads that was so stupid of me. — Ashish Singh, Feb 06 '13 at 04:06

score 0 · Answer 10 · answered Jun 25 '15 at 22:39

same code with with ability to handle non-standard input size images

int idx=blockDim.x*blockIdx.x+threadIdx.x;
int idy=blockDim.y*blockIdx.y+threadIdx.y;

uchar4 rgbcell=rgbaImage[idx*numCols+idy];

   greyImage[idx*numCols+idy]=0.299*rgbcell.x+0.587*rgbcell.y+0.114*rgbcell.z;


  }

  void your_rgba_to_greyscale(const uchar4 * const h_rgbaImage, uchar4 * const d_rgbaImage,
                        unsigned char* const d_greyImage, size_t numRows, size_t numCols)
 {
 //You must fill in the correct sizes for the blockSize and gridSize
 //currently only one block with one thread is being launched

int totalpixels=numRows*numCols;
int factors[]={2,4,8,16,24,32};
vector<int> numbers(factors,factors+sizeof(factors)/sizeof(int));
int factor=1;

   while(!numbers.empty())
  {
 if(totalpixels%numbers.back()==0)
 {
     factor=numbers.back();
     break;
 }
   else
   {
  numbers.pop_back();
   }
 }



 const dim3 blockSize(factor, factor, 1);  //TODO
 const dim3 gridSize(numRows/factor+1, numCols/factor+1,1);  //TODO
 rgba_to_greyscale<<<gridSize, blockSize>>>(d_rgbaImage, d_greyImage,    numRows, numCols);

score 0 · Answer 11 · edited Feb 09 '17 at 16:23

0

1- int x =(blockIdx.x * blockDim.x) + threadIdx.x;

2- int y = (blockIdx.y * blockDim.y) + threadIdx.y;

And in grid and block size

1- const dim3 blockSize(32, 32, 1);

2- const dim3 gridSize((numCols/32+1), (numRows/32+1) , 1);

Code executed in 0.036992 ms.

edited Feb 09 '17 at 16:23

Racil Hilan

24,690
13
50
55

answered Feb 09 '17 at 15:34

Ankur Singh

133
1
12

score 0 · Answer 12 · answered Jul 19 '17 at 03:30

const dim3 blockSize(16, 16, 1);  //TODO
const dim3 gridSize( (numRows+15)/16, (numCols+15)/16, 1);  //TODO

int x = blockIdx.x * blockDim.x + threadIdx.x;  
int y = blockIdx.y * blockDim.y + threadIdx.y;

uchar4 rgba = rgbaImage[y*numRows + x];
float channelSum = .299f * rgba.x + .587f * rgba.y + .114f * rgba.z;
greyImage[y*numRows + x] = channelSum;

colored image to greyscale image using CUDA parallel processing

12 Answers12

Linked