5

tl;dr: What is the fastest way to sort an uint8x16_t?

I need to sort many arrays of exactly 16 unsigned bytes (in descending order, which doesn't matter, of course), and i'm trying to optimize sorting by means of ARM NEON vectorization.

And i find it to be quite a fancy puzzle, as it seems that there "must" exist a short combination of NEON instructions (such as vmax/vpmax/vmin/vpmin, vzip/vuzp) that reliably results in a sorted array.

For example, if we transform a pair (A, B) of two 8-byte arrays into (vpmax(A,B), vpmin(A,B)), we obtain same 16 values, just in different order. If we repeat this operation four times, we reliably have the array maximum in the first cell and the array minimum in the last cell; we cannot be sure about the middle elements though.

Another example: if we first do (C,D)=(vmax(A,B),vmin(A,B)), then we do (E,F)=(vpmax(C,D),vpmin(C,D)), then we do (G,H)=vzip(E,F), then we get our array split into four parts of four bytes, in each part we already know the largest element and the smallest element. Probably the next naive step would be to deinterleave this array to have top four bytes at start of the array (which won't necessary be the top 4 elements of the array, just top bytes of their respective groups) and repeat, not yet sure where it leads at the end.

Is there any known method for this particular problem or for other similar problems (for different array sizes or whatever)? Any ideas are appreciated :)

NoQ
  • 75
  • 6
  • 1
    http://stackoverflow.com/questions/3969813/which-parallel-sorting-algorithm-has-the-best-average-case-performance – artless noise Feb 24 '14 at 16:17
  • 2
    The easiest way is to arrange your data vertically rather than horizontally and implement a 16x16 sorting network using min/max operations. If you can't organise your data this way then you can do a fairly quick 16x16 transpose before/after. – Paul R Feb 24 '14 at 16:41
  • @artless noise: thanks, nice set of links i've missed. They're mostly focused on large arrays, but i guess i'd be able to find something useful. – NoQ Feb 24 '14 at 16:50
  • @Paul R: this "sorting network" thing seems to be the keyword i was looking for, will try it out. – NoQ Feb 24 '14 at 16:52
  • Also http://www.vldb.org/pvldb/1/1454171.pdf looks promising. – artless noise Feb 24 '14 at 17:16
  • 1
    ARM's [NEON Programmer's Guide](http://infocenter.arm.com/help/topic/com.arm.doc.den0018a/) has a worked example of a 7x7 median filter which begins by sorting multiple lists of 7 elements concurrently, and follows with a direct-to-memory transpose. You should be able to extrapolate from that, using a sorting network from [here](http://www.cs.brandeis.edu/~hugues/sorting_networks.html). – sh1 Mar 02 '14 at 19:03

0 Answers0