Here's a worked example that demonstrates 2 of the 3 concepts discussed in the comments:
The first optimization to consider is that 512 threads is not enough to keep any GPU busy. We'd like to be targetting 10000 threads or more. The GPU is a latency-hiding machine, and when you have too few threads to help the GPU hide latency, then your kernel becomes latency-bound, which is a kind of memory-bound problem. The most straightforward way to accomplish this is to have each thread process one pixel in the image (allowing for 512*1024 threads total), rather than one column (allows for only 512 threads total). However, since this seems to "break" our "first-surface detection" algorithm, we must make another modification as well (2).
Once we have all pixels being processed in parallel, then a straightforward adaptation of item 1 above means we no longer know which surface was "first", i.e. which "bright" surface (per column) was closest to row 0. This characteristic of the algorithm changes the problem from a simple transformation to a reduction (one reduction per column of the image, actually.) We will allow each column to be processed in parallel, by having 1 thread assigned to each pixel, but we will choose the resultant pixel that satisfies the brightness test that is closest to row zero. A relatively simple method to do this is simply to use atomicMin
on a one-per-column array of the minimum row (in each column) where a suitably bright pixel neighborhood is discovered.
The following code demonstrates the above 2 changes (and does not include any usage of shared memory) and demonstrates (for me, on a Tesla K40) about a 1x-20x range of speedup vs. OP's original kernel. The range of speedups is due to the fact that the algorithms work varies depending on image structure. Both algorithms have early-exit strategies. The original kernel can do vastly more or less work, due to the early-exit structure on the for-loops, depending on where (if any) "bright" pixel neighborhoods are discovered in each column. So if all columns have bright neighborhoods near row 0, I see an "improvement" of about 1x (i.e. my kernel runs at about the same speed as original). If all columns have bright neighborhoods (only) near the other "end" of the image, I see an improvement of about 20x. This may well vary depending on GPU, as kepler GPUs have improved global atomic throughput, which I am using. EDIT: due to this variable-work, I've added a crude "early-exit" strategy as a trivial modification to my code. This brings the shortest execution time to approximate parity between both kernels (i.e. about 1x).
Remaining optimizations might include:
Use of shared memory. This should be a trivial adaptation of the same tile-based shared memory approach that is used, for example, for matrix multiply. If you use a square-ish tile, then you will want to adjust the kernel block/grid dimensions to make those "square-ish".
Improved reduction technique. For some image structures, the atomic method may be somewhat slow. This could possibly be improved by switching to a proper parallel reduction per column. You can do a "worst-case" test on my kernel by setting the test image to be the threshold value everywhere. This should cause the original kernel to run the fastest and my kernel to run the slowest, but I didn't observe any significant slowdown of my kernel in this case. The execution time of my kernel is pretty constant. Again, this may be GPU-dependent.
Example:
#include <stdlib.h>
#include <stdio.h>
#define SKIP_ROWS 10
#define THRESH 1.0f
#include <time.h>
#include <sys/time.h>
#define USECPSEC 1000000ULL
unsigned long long dtime_usec(unsigned long long start){
timeval tv;
gettimeofday(&tv, 0);
return ((tv.tv_sec*USECPSEC)+tv.tv_usec)-start;
}
__global__ void firstSurfaceDetection (float *threshImg, float *firstSurfaceImg, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int rows2skip = SKIP_ROWS;
float thresh = THRESH;
//thread Index: (0 -> 511)
if (col < width) {
if( col == 0 ) { // first col - 0
for (int row = 0 + rows2skip; row < height; row++) { // skip first 30 rows
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the right
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col+1)]; if(neibs[1] == thresh) { cnt++; } // right
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col+1)]; if(neibs[3] == thresh) { cnt++; } // bottom right
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom right
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
else if ( col == (width-1) ) { // last col
for (int row = 0 + rows2skip; row < height; row++) {
int cnt = 0;
float neibs[6]; // not shared mem as it reduces speed
// get six neighbours - three in same col, and three to the left
neibs[0] = threshImg[((row)*width) +(col)]; if(neibs[0] == thresh) { cnt++; } // current position
neibs[1] = threshImg[((row)*width) +(col-1)]; if(neibs[1] == thresh) { cnt++; } // left
neibs[2] = threshImg[((row+1)*width) +(col)]; if(neibs[2] == thresh) { cnt++; } // bottom
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; } // bottom left
neibs[4] = threshImg[((row+2)*width) +(col)]; if(neibs[4] == thresh) { cnt++; } // curr offset by 2 - bottom
neibs[5] = threshImg[((row+2)*width) +(col-1)]; if(neibs[5] == thresh) { cnt++; } // curr offset by 2 - bottom left
if(cnt == 6) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
// remaining threads are: (1 -> 510)
else { // any col other than first or last column
for (int row = 0 + rows2skip; row < height; row++) {
int cnt = 0;
float neibs[9]; // not shared mem as it reduces speed
// for threads < width/4, get the neighbors
// get nine neighbours - three in curr col, three each to left and right
neibs[0] = threshImg[((row)*width) +(col-1)]; if(neibs[0] == thresh) { cnt++; }
neibs[1] = threshImg[((row)*width) +(col)]; if(neibs[1] == thresh) { cnt++; }
neibs[2] = threshImg[((row)*width) +(col+1)]; if(neibs[2] == thresh) { cnt++; }
neibs[3] = threshImg[((row+1)*width) +(col-1)]; if(neibs[3] == thresh) { cnt++; }
neibs[4] = threshImg[((row+1)*width) +(col)]; if(neibs[4] == thresh) { cnt++; }
neibs[5] = threshImg[((row+1)*width) +(col+1)]; if(neibs[5] == thresh) { cnt++; }
neibs[6] = threshImg[((row+2)*width) +(col-1)]; if(neibs[6] == thresh) { cnt++; }
neibs[7] = threshImg[((row+2)*width) +(col)]; if(neibs[7] == thresh) { cnt++; }
neibs[8] = threshImg[((row+2)*width) +(col+1)]; if(neibs[8] == thresh) { cnt++; }
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
firstSurfaceImg[(row)*width + col] = 1.0f;
row = height;
}
}
}
}
__syncthreads();
}
__global__ void firstSurfaceDetection_opt (const float * __restrict__ threshImg, int *firstSurfaceImgRow, int height, int width) {
int col = threadIdx.x + (blockDim.x*blockIdx.x);
int row = threadIdx.y + (blockDim.y*blockIdx.y);
int rows2skip = SKIP_ROWS;
float thresh = THRESH;
if ((row >= rows2skip) && (row < height-2) && (col < width) && (row < firstSurfaceImgRow[col])) {
int cnt = 0;
int inc = 0;
if (col == 0) inc = +1;
if (col == (width-1)) inc = -1;
if (inc){
cnt = 3;
if (threshImg[((row)*width) +(col)] == thresh) cnt++;
if (threshImg[((row)*width) +(col+inc)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col+inc)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col+inc)] == thresh) cnt++;
}
else {
// get nine neighbours - three in curr col, three each to left and right
if (threshImg[((row)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row)*width) +(col)] == thresh) cnt++;
if (threshImg[((row)*width) +(col+1)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+1)*width) +(col+1)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col-1)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col)] == thresh) cnt++;
if (threshImg[((row+2)*width) +(col+1)] == thresh) cnt++;
}
if(cnt == 9) { // if all neighbours are bright, we are at the edge boundary
atomicMin(firstSurfaceImgRow + col, row);
}
}
}
int main(int argc, char *argv[]){
float *threshImg, *h_threshImg, *firstSurfaceImg, *h_firstSurfaceImg;
int *firstSurfaceImgRow, *h_firstSurfaceImgRow;
int actualImHeight = 1024;
int actualImWidth = 512;
int row_set = 512;
if (argc > 1){
int my_val = atoi(argv[1]);
if ((my_val > SKIP_ROWS) && (my_val < actualImHeight - 3)) row_set = my_val;
}
h_firstSurfaceImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
h_threshImg = (float *)malloc(actualImHeight*actualImWidth*sizeof(float));
h_firstSurfaceImgRow = (int *)malloc(actualImWidth*sizeof(int));
cudaMalloc(&threshImg, actualImHeight*actualImWidth*sizeof(float));
cudaMalloc(&firstSurfaceImg, actualImHeight*actualImWidth*sizeof(float));
cudaMalloc(&firstSurfaceImgRow, actualImWidth*sizeof(int));
cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
cudaMemset(firstSurfaceImg, 0, actualImHeight*actualImWidth*sizeof(float));
for (int i = 0; i < actualImHeight*actualImWidth; i++) h_threshImg[i] = 0.0f;
// insert "bright row" here
for (int i = (row_set*actualImWidth); i < ((row_set+3)*actualImWidth); i++) h_threshImg[i] = THRESH;
cudaMemcpy(threshImg, h_threshImg, actualImHeight*actualImWidth*sizeof(float), cudaMemcpyHostToDevice);
dim3 grid(1,1024);
//warm-up run
firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
cudaMemset(firstSurfaceImgRow, 1, actualImWidth*sizeof(int));
cudaDeviceSynchronize();
unsigned long long t2 = dtime_usec(0);
firstSurfaceDetection_opt <<< grid, 512 >>> (threshImg, firstSurfaceImgRow, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
t2 = dtime_usec(t2);
cudaMemcpy(h_firstSurfaceImgRow, firstSurfaceImgRow, actualImWidth*sizeof(float), cudaMemcpyDeviceToHost);
unsigned long long t1 = dtime_usec(0);
firstSurfaceDetection <<< 1, 512 >>> (threshImg, firstSurfaceImg, actualImHeight, actualImWidth);
cudaDeviceSynchronize();
t1 = dtime_usec(t1);
cudaMemcpy(h_firstSurfaceImg, firstSurfaceImg, actualImWidth*actualImHeight*sizeof(float), cudaMemcpyDeviceToHost);
printf("t1 = %fs, t2 = %fs\n", t1/(float)USECPSEC, t2/(float)USECPSEC);
// validate results
for (int i = 0; i < actualImWidth; i++)
if (h_firstSurfaceImgRow[i] < actualImHeight)
if (h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i] != THRESH)
{printf("mismatch at %d, was %f, should be %d\n", i, h_firstSurfaceImg[(h_firstSurfaceImgRow[i]*actualImWidth)+i], THRESH); return 1;}
return 0;
}
$ nvcc -arch=sm_35 -o t667 t667.cu
$ ./t667
t1 = 0.000978s, t2 = 0.000050s
$
Notes:
the above example inserts a "bright neighborhood" all the way across the image at row=512, thus giving a middle-of-the-road speedup factor of almost 20x in my case (K40c).
for brevity of presentation, I have dispensed with proper cuda error checking. I recommend it however.
The execution performance of either kernel depends quite a bit on whether it is first run or not. This probably has to do with caching and general warm-up effects. Therefore to give sane results, I've run my kernel first as an extra untimed warm-up run.
One of the reasons I haven't pursued a shared-memory optimization is that this problem is already pretty small, and at least for a big kepler GPU like K40, it will fit almost entirely in L2 cache (especially my kernel, since it uses a smaller output data structure.) Given that, shared memory may not give much of a perf boost.
EDIT: I've modified the code (again) so that the line (row) in the test image where the bright boundary is inserted can be passed as a command-line parameter, and I have tested the code on 3 different devices, using 3 different settings for the bright row:
execution time on: K40 C2075 Quadro NVS 310
bright row = 15: 31/33 23/45 29/314
bright row = 512: 978/50 929/112 1232/805
bright row = 1000: 2031/71 2033/149 2418/1285
all times are microseconds (original kernel/optimized kernel)
CUDA 6.5, CentOS 6.2