CUDA - Optimize mean of matrix rows calculation using shared memory

Question

I am trying to optimize the computation of the mean of each row in my 512w x 1024h image, and then subtract the mean from the row from which it was computed. I wrote a piece of code which does it in 1.86 ms, but I want to reduce the speed. This piece of code works fine, but does not use shared memory, and it utilizes for loops. I want to do away with them.

__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {

  // height = 1024, width = 512

  int tidy = threadIdx.x + blockDim.x * blockIdx.x; 

  float sum = 0.0f; 
  float sumDiv = 0.0f; 

  if(tidy < height) { 

      for(int c = 0; c < width; c++) { 

          sum += img[tidy*width + c];
      }
      sumDiv = (sum/width)/2;

      //__syncthreads(); 

      for(int cc = 0; cc < width; cc++) { 

          lineImg[tidy*width + cc] = img[tidy*width + cc] - sumDiv;
      }

  }

  __syncthreads();

I called the above kernel using:

subtractMean <<< 2, 512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);

However, the following code I wrote uses shared memory to optimize. But, it does not work as expected. Any thoughts on what the problem might be?

__global__ void subtractMean (const float *__restrict__ img, float *lineImg, int height, int width) {

  extern __shared__ float perRow[];

  int idx = threadIdx.x;    // set idx along x
  int stride = width/2; 

  while(idx < width) { 
      perRow[idx] = 0; 
      idx += stride; 
  }

  __syncthreads(); 

  int tidx = threadIdx.x;   // set idx along x
  int tidy = blockIdx.x;    // set idx along y

  if(tidy < height) { 
      while(tidx < width) { 
          perRow[tidx] = img[tidy*width + tidx];
          tidx += stride; 
      }
  }

  __syncthreads(); 

  tidx = threadIdx.x;   // reset idx along x
  tidy = blockIdx.x;    // reset idx along y

  if(tidy < height) { 

      float sumAllPixelsInRow = 0.0f; 
      float sumDiv = 0.0f; 

      while(tidx < width) { 
          sumAllPixelsInRow += perRow[tidx];
          tidx += stride;
      }
      sumDiv = (sumAllPixelsInRow/width)/2;

      tidx = threadIdx.x;   // reset idx along x

      while(tidx < width) { 

          lineImg[tidy*width + tidx] = img[tidy*width + tidx] - sumDiv; 
          tidx += stride;
      }
  }

  __syncthreads();  
}

The shared memory function was called using:

subtractMean <<< 1024, 256, sizeof(float)*512 >>> (originalImage, rowMajorImage, actualImHeight, actualImWidth);

Your shared memory kernel looks incoherent to me. So I'm not sure I could use that as a starting point to do anything sensible. A sensible way to approach this problem would be to do a shared memory parallel reduction per row (there are many examples of that on SO, or you can google "cuda parallel reduction", or there is a CUDA sample code for it), to compute the sum. Once you have the sum per row, you can compute the average per row. Then broadcast this average to every thread handling that row (via shared memory), and subtract. You will want to assign one threadblock per row. — Robert Crovella, Feb 17 '15 at 01:46
@RobertCrovella I have assigned one threadblock per row; 1024 rows == 1024 thread blocks, with 256 threads per block. I stride through the 512-element row, 256 elements at a time. So, I can't imagine where I am going wrong? — Eagle, Feb 17 '15 at 01:49
A parallel reduction doesn't involve striding through a row with a fixed stride. Rather than trying to imagine where you are going wrong, I would suggest learning how to write a classical parallel reduction. (As just one example of brokenness, you don't seem to realize that `sumAllPixelsInRow` is a unique i.e. separate local variable *per thread*). You don't have a parallel reduction, and your code bears little resemblance to one. [Here's an SO question](http://stackoverflow.com/questions/17862078/cuda-code-for-sum-of-rows-of-a-matrix-too-slow) that is certainly related. — Robert Crovella, Feb 17 '15 at 02:12
[Here](http://docs.nvidia.com/cuda/samples/6_Advanced/reduction/doc/reduction.pdf) is the canonical tutorial on CUDA parallel reduction. — Robert Crovella, Feb 17 '15 at 02:16

score 1 · Answer 1 · answered Feb 17 '15 at 14:34

2 blocks is hardly enough to saturate GPU use. You are going towards the right approach with utilizing more blocks, however, you are using Kepler and I would like to present an option that does not use shared memory at all.

Start with 32 threads in a block (this can be changed later using 2D blocks) With those 32 threads you should do something along the lines of this:

int rowID = blockIdx.x;
int tid   = threadIdx.x;
int stride= blockDim.x;
int index = threadIdx.x;
float sum=0.0;
while(index<width){
    sum+=img[width*rowID+index];
    index+=blockDim.x;
}

at this point you will have 32 threads that have a partial sum in each of them. You next need to add them all together. You can do this without the use of shared memory (since we are within a warp) by utilizing a shuffle reduction. For details on that look here: http://devblogs.nvidia.com/parallelforall/faster-parallel-reductions-kepler/ what you want is the shuffle warp reduce, but you need to change it to use the full 32 threads.

Now that thread 0 in each warp has the sum of every row, you can divide that by the width cast to a float, and broadcast it to the rest of the warp using shfl using shfl(average, 0);. http://docs.nvidia.com/cuda/cuda-c-programming-guide/#warp-description

With the average found and the warps synchronized implicitly and explicitly (with shfl), you can continue on in a similar method with the subtract.

Possible further optimizations would be to include more than one warp in a block to improve occupancy, and to manually unroll the loops over the width to improve instruction level parallelism.

Good Luck.

CUDA - Optimize mean of matrix rows calculation using shared memory

1 Answers1