CUDA - optimize surface detection kernel

Question

I am trying to optimize my surface detection kernel; given an input binary 512w x 1024h image, I want to find the first bright surface in the image. The code I wrote declared 512 threads, and searches for the first bright pixel in a 3x3 neighborhood. The code works fine, but it is a little slow at ~9.46 ms, and I would like to make it run faster.

Edit 1: Performance has improved by less than half the time it took for the original kernel of mine to run. Robert's kernel runs in 4.032 ms on my Quadro K6000.

Edit 2: Managed to further improve performance by cutting thread count in half. Now, my (Robert's modified) kernel runs in 2.125 ms on my Quadro K6000.

The kernel was called using:

firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);

I would like to use shared memory to improve the memory fetches; any thoughts on how I can optimize this patch of code?

__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {

int col = threadIdx.x + (blockDim.x*blockIdx.x); 
int rows2skip = 10; 
float thresh = 1.0f;

 //thread Index: (0 -> 511)

if (col < width) {

    if( col == 0 ) { // first col - 0
        for (int row = 0 + rows2skip; row < height - 2; row++) { // skip first 30 rows
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the right 
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col+1)];           if(neibs[1] == thresh) { cnt++; }   // right
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col+1)];         if(neibs[3] == thresh) { cnt++; }   // bottom right
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom right

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }
    }

    else if ( col == (width-1) ) { // last col 
        for (int row = 0 + rows2skip; row < height -2; row++) { 
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the left
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col-1)];           if(neibs[1] == thresh) { cnt++; }   // left
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }   // bottom left
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col-1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom left

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }       
    }

    // remaining threads are: (1 -> 510) 

    else { // any col other than first or last column
        for (int row = 0 + rows2skip; row < height - 2; row++) { 

            int cnt = 0;
            float neibs[9]; // not shared mem as it reduces speed   

            // for threads < width/4, get the neighbors
            // get nine neighbours - three in curr col, three each to left and right
            neibs[0] = threshImg[((row)*width) +(col-1)];           if(neibs[0] == thresh) { cnt++; } 
            neibs[1] = threshImg[((row)*width) +(col)];             if(neibs[1] == thresh) { cnt++; } 
            neibs[2] = threshImg[((row)*width) +(col+1)];           if(neibs[2] == thresh) { cnt++; }           
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }           
            neibs[4] = threshImg[((row+1)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }           
            neibs[5] = threshImg[((row+1)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }           
            neibs[6] = threshImg[((row+2)*width) +(col-1)];         if(neibs[6] == thresh) { cnt++; }           
            neibs[7] = threshImg[((row+2)*width) +(col)];           if(neibs[7] == thresh) { cnt++; }           
            neibs[8] = threshImg[((row+2)*width) +(col+1)];         if(neibs[8] == thresh) { cnt++; }

            if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary

                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
                }
            }
        }       
    }           

__syncthreads();
}

I can't help with the question, but make sure you're using [the latest CUDA drivers](https://developer.nvidia.com/cuda-zone) — taco, Feb 13 '15 at 22:11
512 threads is not enough to keep the GPU busy. And if you're interested in performance, you never want to launch kernels that are like this: `<<<1,...>>>` or this: `<<<...,1>>>` You've exposed the parallelism across the width of the image, now it's time to expose the parallelism across the height of the image. Get rid of your for-loops and increase the grid to enough threads (may want to go to a 2D grid) to have each thread process one pixel, instead of having each thread process one column. — Robert Crovella, Feb 14 '15 at 01:39
Once you've gotten your thread count up, you can use shared memory in a very straightforward fashion by loading a block of image data into shared memory and have each thread work out of the shared memory area for the loads and tests. I don't know why people put a `__syncthreads()` at the end of a kernel. It serves no purpose there. — Robert Crovella, Feb 14 '15 at 01:43
@RobertCrovella I figure that you must be referring to loading a 2D square tile of data into shared memory, like in the matrix multiplication example in the CUDA programming guide? — Eagle, Feb 14 '15 at 02:39
Yes, it would be similar to what you might use as a tile-based optimization for matrix multiply. I would implement the increase in threads first, and make sure you have that working correctly. Then add the shared tile optimization. — Robert Crovella, Feb 14 '15 at 03:10
@RobertCrovella I can see how I can load each tile of data into shared memory, but what I fail to understand is how to stop going through all the shared data tiles (for each block) once I find the first surface? After loading into shared memory, all the tiles execute in parallel. Is there a way to iterate through each shared data tile one by one? — Eagle, Feb 14 '15 at 04:33
It's admittedly hard to stop thinking sequentially and start thinking about parallel methods to solve problems. And I didn't focus on the early loop exit implied by `row=height;` The problem then becomes a *reduction*. It appears that your definition of "first surface" is with respect to one edge of the image, i.e. when `row` = 0. Therefore, let every thread and block do it's work in parallel, but use a reduction method (in each column) to select the minimum row in that column for which a bright surface was detected. — Robert Crovella, Feb 14 '15 at 14:58
I would start by using an `atomicMin`, and later see if a full parallel reduction gave any benefit. `atomicMin` should be fine if the "density" of fully bright pixel neighborhoods in your test image is "low". This means you might want to use an additional data set that represents the minimum bright spot, one value per column. Later, if you need to convert this back into the `firstSurfaceImg` = 1.0f representation, that should be trivial to do. — Robert Crovella, Feb 14 '15 at 15:00
by the way, this construct: `threshImg[((row+2)*width)` certainly looks to me like it has the potential to index out-of-bounds of `threshImg`. Perhaps your for-loops should stop at `row < height -2` — Robert Crovella, Feb 14 '15 at 15:17
@RobertCrovella +1 for catching the out-of-bounds index. I found it in my code earlier, and fixed it, but I had forgotten to update my question here. Also, I see what you mean by assume the problem is a reduction. However, I don't think you fully understood what I am doing in my code; this relates to your comment `It appears that your definition of "first surface" is with respect to one edge of the image, i.e. when row = 0`. — Eagle, Feb 14 '15 at 20:58
@RobertCrovella You are right in saying that I am trying to find the first edge. To that end, I am finding the row in each column at which there is a local maximum. As soon as I find that row, I break out of my loop. — Eagle, Feb 14 '15 at 20:59
You're starting from row = 0 (or skip rows, whatever) and moving "downward" in the image to find the first "bright surface". That is what your for-loops do. (Right?) Therefore, your definition of "first bright surface" is the first "bright" neighborhood closest to row=0 (in each column). Anyway, the two kernels I present (yours and mine) produce equivalent results, so I think I've interpreted it correctly. — Robert Crovella, Feb 14 '15 at 21:22

score 2 · Accepted Answer · edited May 23 '17 at 11:43

Here's a worked example that demonstrates 2 of the 3 concepts discussed in the comments:

The first optimization to consider is that 512 threads is not enough to keep any GPU busy. We'd like to be targetting 10000 threads or more. The GPU is a latency-hiding machine, and when you have too few threads to help the GPU hide latency, then your kernel becomes latency-bound, which is a kind of memory-bound problem. The most straightforward way to accomplish this is to have each thread process one pixel in the image (allowing for 512*1024 threads total), rather than one column (allows for only 512 threads total). However, since this seems to "break" our "first-surface detection" algorithm, we must make another modification as well (2).
Once we have all pixels being processed in parallel, then a straightforward adaptation of item 1 above means we no longer know which surface was "first", i.e. which "bright" surface (per column) was closest to row 0. This characteristic of the algorithm changes the problem from a simple transformation to a reduction (one reduction per column of the image, actually.) We will allow each column to be processed in parallel, by having 1 thread assigned to each pixel, but we will choose the resultant pixel that satisfies the brightness test that is closest to row zero. A relatively simple method to do this is simply to use atomicMin on a one-per-column array of the minimum row (in each column) where a suitably bright pixel neighborhood is discovered.

The following code demonstrates the above 2 changes (and does not include any usage of shared memory) and demonstrates (for me, on a Tesla K40) about a 1x-20x range of speedup vs. OP's original kernel. The range of speedups is due to the fact that the algorithms work varies depending on image structure. Both algorithms have early-exit strategies. The original kernel can do vastly more or less work, due to the early-exit structure on the for-loops, depending on where (if any) "bright" pixel neighborhoods are discovered in each column. So if all columns have bright neighborhoods near row 0, I see an "improvement" of about 1x (i.e. my kernel runs at about the same speed as original). If all columns have bright neighborhoods (only) near the other "end" of the image, I see an improvement of about 20x. This may well vary depending on GPU, as kepler GPUs have improved global atomic throughput, which I am using. EDIT: due to this variable-work, I've added a crude "early-exit" strategy as a trivial modification to my code. This brings the shortest execution time to approximate parity between both kernels (i.e. about 1x).

Remaining optimizations might include:

Use of shared memory. This should be a trivial adaptation of the same tile-based shared memory approach that is used, for example, for matrix multiply. If you use a square-ish tile, then you will want to adjust the kernel block/grid dimensions to make those "square-ish".
Improved reduction technique. For some image structures, the atomic method may be somewhat slow. This could possibly be improved by switching to a proper parallel reduction per column. You can do a "worst-case" test on my kernel by setting the test image to be the threshold value everywhere. This should cause the original kernel to run the fastest and my kernel to run the slowest, but I didn't observe any significant slowdown of my kernel in this case. The execution time of my kernel is pretty constant. Again, this may be GPU-dependent.

Example:

#include <stdlib.h>
#include <stdio.h>

#define SKIP_ROWS 10
#define THRESH 1.0f

#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL

unsigned long long dtime_usec(unsigned long long start){

  timeval tv;
  gettimeofday(&tv, 0);
  return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}

__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {

int col = threadIdx.x + (blockDim.x*blockIdx.x); 
int rows2skip = SKIP_ROWS; 
float thresh = THRESH;

 //thread Index: (0 -> 511)

if (col < width) {

    if( col == 0 ) { // first col - 0
        for (int row = 0 + rows2skip; row < height; row++) { // skip first 30 rows
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the right 
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col+1)];           if(neibs[1] == thresh) { cnt++; }   // right
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col+1)];         if(neibs[3] == thresh) { cnt++; }   // bottom right
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom right

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }
    }

    else if ( col == (width-1) ) { // last col 
        for (int row = 0 + rows2skip; row < height; row++) { 
            int cnt = 0;
             float neibs[6]; // not shared mem as it reduces speed  

            // get six neighbours - three in same col, and three to the left
            neibs[0] = threshImg[((row)*width) +(col)];             if(neibs[0] == thresh) { cnt++; }   // current position
            neibs[1] = threshImg[((row)*width) +(col-1)];           if(neibs[1] == thresh) { cnt++; }   // left
            neibs[2] = threshImg[((row+1)*width) +(col)];           if(neibs[2] == thresh) { cnt++; }   // bottom
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }   // bottom left
            neibs[4] = threshImg[((row+2)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }   // curr offset by 2 - bottom
            neibs[5] = threshImg[((row+2)*width) +(col-1)];         if(neibs[5] == thresh) { cnt++; }   // curr offset by 2 - bottom left

            if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
            }
        }       
    }

    // remaining threads are: (1 -> 510) 

    else { // any col other than first or last column
        for (int row = 0 + rows2skip; row < height; row++) { 

            int cnt = 0;
            float neibs[9]; // not shared mem as it reduces speed   

            // for threads < width/4, get the neighbors
            // get nine neighbours - three in curr col, three each to left and right
            neibs[0] = threshImg[((row)*width) +(col-1)];           if(neibs[0] == thresh) { cnt++; } 
            neibs[1] = threshImg[((row)*width) +(col)];             if(neibs[1] == thresh) { cnt++; } 
            neibs[2] = threshImg[((row)*width) +(col+1)];           if(neibs[2] == thresh) { cnt++; }           
            neibs[3] = threshImg[((row+1)*width) +(col-1)];         if(neibs[3] == thresh) { cnt++; }           
            neibs[4] = threshImg[((row+1)*width) +(col)];           if(neibs[4] == thresh) { cnt++; }           
            neibs[5] = threshImg[((row+1)*width) +(col+1)];         if(neibs[5] == thresh) { cnt++; }           
            neibs[6] = threshImg[((row+2)*width) +(col-1)];         if(neibs[6] == thresh) { cnt++; }           
            neibs[7] = threshImg[((row+2)*width) +(col)];           if(neibs[7] == thresh) { cnt++; }           
            neibs[8] = threshImg[((row+2)*width) +(col+1)];         if(neibs[8] == thresh) { cnt++; }

            if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary

                firstSurfaceImg[(row)*width + col] = 1.0f;
                row = height;
                }
            }
        }       
    }           

__syncthreads();
}

__global__ void firstSurfaceDetection_opt (const float * __restrict__ threshImg, int *firstSurfaceImgRow, int height, int width) {

  int col = threadIdx.x + (blockDim.x*blockIdx.x); 
  int row = threadIdx.y + (blockDim.y*blockIdx.y);

  int rows2skip = SKIP_ROWS; 
  float thresh = THRESH;

  if ((row >= rows2skip) && (row < height-2) && (col < width) && (row < firstSurfaceImgRow[col])) {

    int cnt = 0;
    int inc = 0;
    if (col == 0) inc = +1;
    if (col == (width-1)) inc = -1;
    if (inc){
            cnt = 3;
            if (threshImg[((row)*width)   +(col)]     == thresh) cnt++;
            if (threshImg[((row)*width)   +(col+inc)] == thresh) cnt++;
            if (threshImg[((row+1)*width) +(col)]     == thresh) cnt++;   
            if (threshImg[((row+1)*width) +(col+inc)] == thresh) cnt++;      
            if (threshImg[((row+2)*width) +(col)]     == thresh) cnt++;     
            if (threshImg[((row+2)*width) +(col+inc)] == thresh) cnt++;
            }
    else {
            // get nine neighbours - three in curr col, three each to left and right
            if (threshImg[((row)*width)   +(col-1)] == thresh) cnt++;
            if (threshImg[((row)*width)   +(col)]   == thresh) cnt++;
            if (threshImg[((row)*width)   +(col+1)] == thresh) cnt++;
            if (threshImg[((row+1)*width) +(col-1)] == thresh) cnt++;
            if (threshImg[((row+1)*width) +(col)]   == thresh) cnt++;   
            if (threshImg[((row+1)*width) +(col+1)] == thresh) cnt++;      
            if (threshImg[((row+2)*width) +(col-1)] == thresh) cnt++;
            if (threshImg[((row+2)*width) +(col)]   == thresh) cnt++;     
            if (threshImg[((row+2)*width) +(col+1)] == thresh) cnt++;
            }
    if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
            atomicMin(firstSurfaceImgRow + col, row);
            }
    }
}


int main(int argc, char *argv[]){

  float *threshImg, *h_threshImg, *firstSurfaceImg, *h_firstSurfaceImg;
  int *firstSurfaceImgRow, *h_firstSurfaceImgRow;
  int actualImHeight = 1024;
  int actualImWidth = 512;
  int row_set = 512;
  if (argc > 1){
    int my_val = atoi(argv[1]);
    if ((my_val > SKIP_ROWS) && (my_val < actualImHeight - 3)) row_set = my_val;
    }

  h_firstSurfaceImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
  h_threshImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
  h_firstSurfaceImgRow = (int *)malloc(actualImWidth*sizeof(int));
  cudaMalloc(&threshImg, actualImHeight*actualImWidth*sizeof(float));
  cudaMalloc(&firstSurfaceImg, actualImHeight*actualImWidth*sizeof(float));
  cudaMalloc(&firstSurfaceImgRow, actualImWidth*sizeof(int));
  cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
  cudaMemset(firstSurfaceImg, 0, actualImHeight*actualImWidth*sizeof(float));

  for (int i = 0; i < actualImHeight*actualImWidth; i++) h_threshImg[i] = 0.0f;
  // insert "bright row" here
  for (int i = (row_set*actualImWidth); i < ((row_set+3)*actualImWidth); i++) h_threshImg[i] = THRESH;

  cudaMemcpy(threshImg, h_threshImg, actualImHeight*actualImWidth*sizeof(float), cudaMemcpyHostToDevice);


  dim3 grid(1,1024);
  //warm-up run
  firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
  cudaDeviceSynchronize();
  cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
  cudaDeviceSynchronize();
  unsigned long long t2 = dtime_usec(0);
  firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
  cudaDeviceSynchronize();
  t2 = dtime_usec(t2);
  cudaMemcpy(h_firstSurfaceImgRow, firstSurfaceImgRow, actualImWidth*sizeof(float), cudaMemcpyDeviceToHost);
  unsigned long long t1 = dtime_usec(0);
  firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
  cudaDeviceSynchronize();
  t1 = dtime_usec(t1);
  cudaMemcpy(h_firstSurfaceImg, firstSurfaceImg, actualImWidth*actualImHeight*sizeof(float), cudaMemcpyDeviceToHost); 

  printf("t1 = %fs, t2 = %fs\n", t1/(float)USECPSEC, t2/(float)USECPSEC);
  // validate results
  for (int i = 0; i < actualImWidth; i++) 
    if (h_firstSurfaceImgRow[i] < actualImHeight) 
      if (h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i] != THRESH)
        {printf("mismatch at %d, was %f, should be %d\n", i, h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i], THRESH); return 1;}
  return 0;
}
$ nvcc -arch=sm_35 -o t667 t667.cu
$ ./t667
t1 = 0.000978s, t2 = 0.000050s
$

Notes:

the above example inserts a "bright neighborhood" all the way across the image at row=512, thus giving a middle-of-the-road speedup factor of almost 20x in my case (K40c).
for brevity of presentation, I have dispensed with proper cuda error checking. I recommend it however.
The execution performance of either kernel depends quite a bit on whether it is first run or not. This probably has to do with caching and general warm-up effects. Therefore to give sane results, I've run my kernel first as an extra untimed warm-up run.
One of the reasons I haven't pursued a shared-memory optimization is that this problem is already pretty small, and at least for a big kepler GPU like K40, it will fit almost entirely in L2 cache (especially my kernel, since it uses a smaller output data structure.) Given that, shared memory may not give much of a perf boost.

EDIT: I've modified the code (again) so that the line (row) in the test image where the bright boundary is inserted can be passed as a command-line parameter, and I have tested the code on 3 different devices, using 3 different settings for the bright row:

execution time on:     K40    C2075    Quadro NVS 310
bright row =   15:   31/33    23/45       29/314
bright row =  512:  978/50  929/112     1232/805
bright row = 1000: 2031/71 2033/149    2418/1285

all times are microseconds (original kernel/optimized kernel)
CUDA 6.5, CentOS 6.2

You are right in saying that I am trying to find the first edge. To that end, I am finding the row in each column at which there is a local maximum. As soon as I find that row, I break out of my loop. So, if that is the case, then shouldn't `h_firstSurfaceImgRow` in your code be initialized to `cudaMemset(firstSurfaceImgRow, numeric_limits::max(), actualImWidth*sizeof(int));` as opposed to memsetting the array to `1`? Comparing the minimum row value with the `max_int_range` should yield the first edge in the image. — Eagle, Feb 14 '15 at 21:02
I'm not memsetting the array to 1. I'm memsetting it to 0x01010101 (That is how memset works.) That number is large enough that it is equivalent to std::numeric_limits::max() (for the purposes here, where the maximum value is 1024). You're welcome to use std::numeric_limits::max(). (but you won't be able to do it with `cudaMemset`) I'm just being sloppy. — Robert Crovella, Feb 14 '15 at 21:11
Performance on my Quadro K6000 has improved by less than half the original time it took for my kernel to run. Accepted and upvoted as an answer. — Eagle, Feb 14 '15 at 22:06
Improved performance even more by cutting the thread count by half. It takes `~2.125 ms` to run on my K6000. — Eagle, Feb 14 '15 at 23:12

CUDA - optimize surface detection kernel

1 Answers1