Troubles with image processing in CUDA

Question

I'm new to CUDA and I'm trying to do my first project in it. I'm trying to push the image data to GPU, make it black and white there and write it in new image. But program gives me a black image instead of black and white version. What am I doing wrong? Width and height of image are 3840x2160.

__global__ void addMatrix(unsigned char *DataOut, unsigned char *DataIn)
{
    int idx = threadIdx.x;
    DataOut[idx] = (DataIn[idx] + DataIn[idx + 1] + DataIn[idx + 2]) / 3;
    //
}
int main()

{

int iWidth, iHeight, iBpp, iHeightOut, iWidthOut;

vector<unsigned char> pDataIn;
vector<unsigned char> pDataOut;


int error1 = LoadBmpFile(L"3840x2160.bmp", iWidth, iHeight, iBpp, pDataIn);

if (error1 != 0 || pDataIn.size() == 0 || iBpp != 32)
{
    std::cout << "erroror load input file!\n";
}


pDataOut.resize(pDataIn.size()/4);

unsigned int SizeIn, SizeOut;
unsigned char *devDatOut, *devDatIn, *PInData, *POutData;

int i = 0;
SizeIn = pDataIn.size();
SizeOut = pDataOut.size();
PInData = pDataIn.data();
POutData = pDataOut.data();

i = cudaMalloc((void**)&devDatIn, SizeIn * sizeof(unsigned char));
if(i != 0)
{
printf("cudaMalloc __e FAIL! Code: %d\n", i);
_getch();
}
i = cudaMalloc((void**)&devDatOut, SizeOut * sizeof(unsigned char));
if(i != cudaSuccess)
printf("cudaMalloc __e FAIL! Code: %d\n", i);

i = cudaMemcpy(devDatIn, PInData, SizeIn * sizeof(unsigned char), cudaMemcpyHostToDevice);
if(i != cudaSuccess)
printf(" cudaMemcpy __e FAIL! Code: %d\n", i);
i = cudaMemcpy(devDatOut, POutData, SizeOut * sizeof(unsigned char), cudaMemcpyHostToDevice);
if(i != cudaSuccess)
printf(" cudaMemcpy __e FAIL! Code: %d\n", i);

dim3 gridSize = dim3(1, 1, 1);   
dim3 blockSize = dim3(SizeIn, 1, 1);

addMatrix<<<gridSize, blockSize>>>(devDatIn, devDatOut);
if ( i == cudaGetLastError() )
{
printf( "Error! %d\n", cudaGetLastError() );
_getch;
}

 cudaEvent_t syncEvent;

cudaEventCreate(&syncEvent); 
cudaEventRecord(syncEvent, 0); 
 cudaEventSynchronize(syncEvent);  

 cudaMemcpy(POutData, devDatOut, SizeOut * sizeof(unsigned char), cudaMemcpyDeviceToHost);

  cudaEventDestroy(syncEvent);

i = WriteBmpFile(L"3840x2160_test2.bmp", iWidth, iHeight, 8, pDataOut.size(), pDataOut.data(), false);
    if(i != 0)
    printf(" cudaMemcpy __e FAIL! Code: %d\n", i);

    cudaFree(devDatOut);
  cudaFree(devDatIn);

}

EDIT 1:

presumably `SizeIn` is something like 3840x2160 = approximately 8 million. You can't use a CUDA block size of this: `dim3 blockSize = dim3(SizeIn, 1, 1);` when `SizeIn` is larger than 1024. Furthermore, the `cudaGetLastError()` call immediately after the kernel launch should be returning an error due to that. Your kernel would have to be adjusted also, and your kernel indexing into the data is not correct either - it needs to stride by 3 per thread. — Robert Crovella, Apr 13 '16 at 12:22
So I need to use dim3 blockSize = dim3(1024, 1, 1), right? After that cudaGetLastError() doesn't return anything, but programm still gives me a black image. I just don't get what's wrong with the kernel part. Can you please modify my code to make it work fine? — Generwp, Apr 13 '16 at 12:39

score 2 · Accepted Answer · edited May 23 '17 at 12:31

As seen in comments, you launch too much threads per block and the launch is probably failing, but your error checking is incomplete and you can't see it. See this answer for proper error checking after launching a kernel.

By the way, you have to launch several blocks of less threads, instead of one block with all these threads. For example it could be 2000 blocks of 1024 threads. On my current cards 1024 threads is the maximum I can have in each block, so I launch several blocks if I need more threads, and in general I launch more blocks with less threads to achieve better occupancy. You can learn more about occupancy with this presentation.

So for example, if you had 10'000 pixels in total, you could launch say 20 blocks of 500 threads each. You can do this in one dimension for both :

dim3 blocks(20, 1, 1);
dim3 threads(500, 1, 1);

myKernel<<<blocks, threads>>>(...);

And a little change will be necessary in your kernel to correctly map the thread index in each specific block to a unique linear index in the picture buffer in memory (wrote this in the browser, not tested with your specific project, but linearization is quite simple to understand by thinking it on a paper with a good old pen) :

__global__ void addMatrix(unsigned char *DataOut, unsigned char *DataIn)
{
    int idx = BlockIdx.x * BlockDim.x + threadIdx.x;

    unsigned char average = (unsigned char)((DataIn[idx] + DataIn[idx + 1] + DataIn[idx + 2]) / 3);

    DataOut[idx + 0] = average;
    DataOut[idx + 1] = average;
    DataOut[idx + 2] = average;
}

Thanks for answer! Especially for errors and occupancy. I edited the code and now I have a few little questions. So I have 3840x2160 image and the limit of threads is 1024. But how many blocks can I use? I've tried *dim3 blocks(5000, 1, 1);* *dim3 threads(1024, 1, 1);* Also I tried to use this code in kernel just to get the same image, but output is quite different from expected. Whats wrong with that? I added the new output image to the post. *int idx = blockIdx.x * blockDim.x + threadIdx.x;* *DataOut[idx] = DataIn[idx];* — Generwp, Apr 13 '16 at 14:43
(3840*2160)/1024 = 8100 => So I need to use 8100 block to proccess an image 3840x2160 with 1024 threads in each block, am I right? — Generwp, Apr 13 '16 at 14:51
Yes ! But remember, here it is **exactly** 8100. A correct code adapting to the size would divide by 1024.f instead of 1024, and ensure the result is rounded up, because if your division returns 4.7 by casting in int it will be 4 blocks, but in fact you would need 5 blocks to compute every pixel. See this answer on how to round to superior : http://stackoverflow.com/a/2745086/6172231 **Edit :** And you will need a if statement in the kernel to ensure you don't go out of bounds in your array, because launching this additional block will make you have some of his threads not needed. — Taro, Apr 14 '16 at 09:36

Troubles with image processing in CUDA

1 Answers1