3

I need to compute the median of an array of size p inside a CUDA kernel (in my case, p is small e.g. p = 10). I am using an O(p^2) algorithm for its simplicity, but at the cost of time performance.

Is there a "function" to find the median efficiently that I can call inside a CUDA kernel?

I know I could implement a selection algorithm, but I'm looking for a function and/or tested code.

Thanks!

rodms
  • 371
  • 3
  • 12
  • 3
    Given the small value of `p`, have you considered using a templated function that employs a minimal sorting network? – njuffa Aug 30 '13 at 18:10
  • 1
    The small value of `p` probably indicates you should write your own code, as others have already suggested. If you want to look at some basic sample codes, don't forget to look at the various sorting codes in the [cuda samples](http://docs.nvidia.com/cuda/cuda-samples/index.html) as well as [CUB](http://nvlabs.github.io/cub/). – Robert Crovella Aug 30 '13 at 19:01
  • Try implementing http://en.wikipedia.org/wiki/Median_of_medians – Pavan Yalamanchili Aug 31 '13 at 05:26
  • 1
    In the first Graphics Gems book, Paeth had a nice implementation of 3x3 median that discards mins and maxes until you're left with the median. Paeth, Alan W., Median Finding on a 3-by-3 Grid, p. 171-175, code: p. 711-712. C code here: http://cudahandbook.to/1dCFCOQ – ArchaeaSoftware Aug 31 '13 at 22:43
  • @rodms could you able compute median of array in cuda? if yes please give the way you did. – Jony Oct 29 '13 at 12:53
  • @AshishDonvir Not as efficiently as I would like, though I haven't tried implementing a selection algorithm. – rodms Oct 29 '13 at 16:56
  • @rodms could you look at my [this](http://stackoverflow.com/questions/19634328/cuda-make-more-faster-and-optimized-median-filter-for-2d-image) question about median filter and suggest me somthing to make it more efficient? can i have better way to assign neighbour pixel in that? – Jony Oct 31 '13 at 05:25

2 Answers2

3

Here are a few hints:

  1. Use a better selection algorithm: QuickSelect is a faster version of QuickSort for selecting the kth element in an array. For compile-time-constant mask sizes, sorting networks are even faster, thanks to high TLP and a O(log^2 n) critical path. If you only have 8-bit values, you can use a histogram-based approach. This paper describes an implementation that takes constant time per pixel, independent of mask size, which makes it very fast for very large mask sizes. You can parallelize it by using a minimal launch strategy (only run as many threads as you need to keep all SMs at max capacity), tiling the image, and letting threads of the same block cooperate on each kernel histogram.
  2. Sort in registers. For small mask sizes, you can keep the entire array in registers, making median selection with a sorting network much faster. For larger mask sizes, you can use shared memory.
  3. Copy all pixels used by the block to shared memory first, and then copy to thread-local buffers that are also in shared memory.
  4. If you only have a few masks that need to go really fast (such as 3x3 and 5x5), use templates to make them compile time constants. This can speed things up a lot because the compiler can unroll loops and re-order a lot more instructions, possibly improving load batching and other goodies, leading to large speed-ups.
  5. Make sure, your reads are coalesced and aligned.

There are many other optimizations you can do. Make sure, you read through the CUDA documents, especially the Programming Guide and the Best Practices Guide. When you really want to gun for high performance, don't forget to take a good look at a CUDA profiler, such as the Visual Profiler.

Community
  • 1
  • 1
Domi
  • 22,151
  • 15
  • 92
  • 122
1

Even in a single thread one can sort the array and pick the value in the middle in O(p*log(p)), which makes O(p^2) look excessive. If you have p threads at your disposal it's also possible to sort the array as fast as O(log(p)), although that may not be the fastest solution for small p. See the top answer here:

Which parallel sorting algorithm has the best average case performance?

Community
  • 1
  • 1
Michael
  • 5,775
  • 2
  • 34
  • 53