Paralleling a simple algorithm on GPU with CUDA

Question

I have a CUDA function that calculates Local Binary Patterns on GPU. Basically LBP is a computation over the pixels of an image where the value of any given pixel (i,j) depends on it's 8 neighbors' intensities.

So far so good, the code is the following:

//The kernel
__global__ void LBP(unsigned char *in, unsigned char *out, const int w, const int h)
{
    const unsigned int i = (blockIdx.x * blockDim.x) + threadIdx.x;

    //Don't do edges!
    if(
             i < w              //first row
        ||   i >= (w * (h - 1)) // last row
        || !(i % w)             // first column
        ||  (i % w + 1 == w)    // last column
    )
    {
        out[i] = 0;
        return;
    }

    unsigned char
        code = 0,
        center = in[i];

    code |= (in[i-w-1] > center) << 7;
    code |= (in[i-w  ] > center) << 6;
    code |= (in[i-w+1] > center) << 5;
    code |= (in[i  +1] > center) << 4;
    code |= (in[i+w+1] > center) << 3;
    code |= (in[i+w  ] > center) << 2;
    code |= (in[i+w-1] > center) << 1;
    code |= (in[i  -1] > center) << 0;

    out[i] = code;
}

// A proxi function
void DoLBP(unsigned char *in, unsigned char *out, const int w, const int h)
{
    const int
        sz = w * h * sizeof(unsigned char);
    unsigned char
        *in_gpu,
        *out_gpu;

    cudaMalloc((void**)&in_gpu,  sz);
    cudaMalloc((void**)&out_gpu, sz);

    cudaMemcpy(in_gpu,  in,  sz, cudaMemcpyHostToDevice);
    cudaMemcpy(out_gpu, out, sz, cudaMemcpyHostToDevice);

    dim3 threadsPerBlock(1024); //Max
    dim3 numBlocks(w*h/threadsPerBlock.x + 1);
    LBP<<<numBlocks,threadsPerBlock>>>(in_gpu, out_gpu, w, h);

    cudaMemcpy(out, out_gpu, sz, cudaMemcpyDeviceToHost);

    cudaFree(in_gpu);
    cudaFree(out_gpu);
}

//The caller
int main()
{
    printf("Starting\n");

    const int
        w = 4000,
        h = 2000;
    unsigned char
        in[w*h],
        out[w*h];

    // Fill [in] with some data 

    DoLBP(in, out, w, h);


    // Use [out] data

    return 0;
}

The images are passed to the GPU as a single-dimension array of *unsigned char*s (array = [[row 1] [row 2] [row 3] ... [row n]]) (they are extracted from OpenCV's Mat)

The problem

This code works fine with relatively small images and it returns the output array populated with the right values but when the image size grows, the output array is all zeroed!

My suspicion is that the image data is overflowing some GPU buffer or something like that.

It is also not clear to me how does the numberOfBlocks and threadsPerBlock part work! If any of you could provide some basic insight about this it would be much appreciated.

(I'm like 1-day-old in CUDA, so there might be way too many ways to improve this snippet of code!)

score 5 · Accepted Answer · edited May 23 '17 at 10:28

I would suggest adding proper cuda error checking to your code. I believe your kernel is making out-of-bounds accesses and failing.
run your code with cuda-memcheck, as it will help identify why the kernel is failing.
These are fairly large allocations to make on the stack:
```
const int
  w = 4000,
  h = 2000;
unsigned char
  in[w*h],
  out[w*h];
```
roughly 8MB each. That can be a problem; it may be system-dependent. It's usually better to do large allocations via dynamic allocation e.g. malloc. On my particular system, I get a seg fault due to these large stack variables not being allocated correctly.
Your kernel is missing an appropriate "thread check". At first I thought you were doing a good job with this:
```
if(
     i < w              //first row
  ||   i >= (w * (h - 1)) // last row
  || !(i % w)             // first column
  ||  (i % w + 1 == w)    // last column
)
```
but this is a problem:
```
out[i] = 0;
return;
```
If you comment out the out[i] = 0; line, you'll have better luck. Alternatively, if you don't like commenting it out, you could do:
```
if (i < (w*h)) out[i] = 0;
```
The problem is your grid launch parameters necessarily create "extra threads":
```
dim3 threadsPerBlock(1024); //Max
dim3 numBlocks(w*h/threadsPerBlock.x + 1);
```
If you have a proper thread check (which you almost do...), then it's not a problem. But you can't let those extra threads write to invalid locations.

To explain thread per block and number of blocks, working through the arithmetic may be useful. A cuda kernel launch has an associated grid. The grid is simply all the threads associated with a kernel launch. The threads will be divided into blocks. So the grid is equal to the number of blocks launched times the threads per block. How many is that in your case? This line says you are asking for 1024 threads per block:

    dim3 threadsPerBlock(1024); //Max

The number of blocks you are launching is given by:

    dim3 numBlocks(w*h/threadsPerBlock.x + 1);

The arithmetic is:

    (w=4000)*(h=2000)/1024 = 7812.5 = 7812   (note this is an *integer* divide)

Then we add 1. So you are launching 7813 blocks. How many threads is that?

    (7813 blocks)*(1024 threads per block) = 8000512 threads

But you only need (and only want) 8000000 threads (= w * h) So you need a thread check to prevent the extra 512 threads from trying to access out[i]. But your thread check is broken in this respect.

As a final note, the most obvious way to me to make this code run faster would be to exploit the data-reuse in adjacent operations via shared memory. But get your code working correctly first.

On number 3: The arrays are allocated dynamically, not on the stack on the actual implementation, that's basically a naive example, although you made a good observation there! The answer looks very informative! I'll try your suggestions and feedback to you :) — ButterDog, Mar 07 '14 at 22:44

Paralleling a simple algorithm on GPU with CUDA

1 Answers1