418

Answering to another Stack Overflow question (this one) I stumbled upon an interesting sub-problem. What is the fastest way to sort an array of 6 integers?

As the question is very low level:

  • we can't assume libraries are available (and the call itself has its cost), only plain C
  • to avoid emptying instruction pipeline (that has a very high cost) we should probably minimize branches, jumps, and every other kind of control flow breaking (like those hidden behind sequence points in && or ||).
  • room is constrained and minimizing registers and memory use is an issue, ideally in place sort is probably best.

Really this question is a kind of Golf where the goal is not to minimize source length but execution time. I call it 'Zening' code as used in the title of the book Zen of Code optimization by Michael Abrash and its sequels.

As for why it is interesting, there is several layers:

  • the example is simple and easy to understand and measure, not much C skill involved
  • it shows effects of choice of a good algorithm for the problem, but also effects of the compiler and underlying hardware.

Here is my reference (naive, not optimized) implementation and my test set.

#include <stdio.h>

static __inline__ int sort6(int * d){

    char j, i, imin;
    int tmp;
    for (j = 0 ; j < 5 ; j++){
        imin = j;
        for (i = j + 1; i < 6 ; i++){
            if (d[i] < d[imin]){
                imin = i;
            }
        }
        tmp = d[j];
        d[j] = d[imin];
        d[imin] = tmp;
    }
}

static __inline__ unsigned long long rdtsc(void)
{
  unsigned long long int x;
     __asm__ volatile (".byte 0x0f, 0x31" : "=A" (x));
     return x;
}

int main(int argc, char ** argv){
    int i;
    int d[6][5] = {
        {1, 2, 3, 4, 5, 6},
        {6, 5, 4, 3, 2, 1},
        {100, 2, 300, 4, 500, 6},
        {100, 2, 3, 4, 500, 6},
        {1, 200, 3, 4, 5, 600},
        {1, 1, 2, 1, 2, 1}
    };

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6 ; i++){
        sort6(d[i]);
        /*
         * printf("d%d : %d %d %d %d %d %d\n", i,
         *  d[i][0], d[i][6], d[i][7],
         *  d[i][8], d[i][9], d[i][10]);
        */
    }
    cycles = rdtsc() - cycles;
    printf("Time is %d\n", (unsigned)cycles);
}

Raw results

As number of variants is becoming large, I gathered them all in a test suite that can be found here. The actual tests used are a bit less naive than those showed above, thanks to Kevin Stock. You can compile and execute it in your own environment. I'm quite interested by behavior on different target architecture/compilers. (OK guys, put it in answers, I will +1 every contributor of a new resultset).

I gave the answer to Daniel Stutzbach (for golfing) one year ago as he was at the source of the fastest solution at that time (sorting networks).

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O2

  • Direct call to qsort library function : 689.38
  • Naive implementation (insertion sort) : 285.70
  • Insertion Sort (Daniel Stutzbach) : 142.12
  • Insertion Sort Unrolled : 125.47
  • Rank Order : 102.26
  • Rank Order with registers : 58.03
  • Sorting Networks (Daniel Stutzbach) : 111.68
  • Sorting Networks (Paul R) : 66.36
  • Sorting Networks 12 with Fast Swap : 58.86
  • Sorting Networks 12 reordered Swap : 53.74
  • Sorting Networks 12 reordered Simple Swap : 31.54
  • Reordered Sorting Network w/ fast swap : 31.54
  • Reordered Sorting Network w/ fast swap V2 : 33.63
  • Inlined Bubble Sort (Paolo Bonzini) : 48.85
  • Unrolled Insertion Sort (Paolo Bonzini) : 75.30

Linux 64 bits, gcc 4.6.1 64 bits, Intel Core 2 Duo E8400, -O1

  • Direct call to qsort library function : 705.93
  • Naive implementation (insertion sort) : 135.60
  • Insertion Sort (Daniel Stutzbach) : 142.11
  • Insertion Sort Unrolled : 126.75
  • Rank Order : 46.42
  • Rank Order with registers : 43.58
  • Sorting Networks (Daniel Stutzbach) : 115.57
  • Sorting Networks (Paul R) : 64.44
  • Sorting Networks 12 with Fast Swap : 61.98
  • Sorting Networks 12 reordered Swap : 54.67
  • Sorting Networks 12 reordered Simple Swap : 31.54
  • Reordered Sorting Network w/ fast swap : 31.24
  • Reordered Sorting Network w/ fast swap V2 : 33.07
  • Inlined Bubble Sort (Paolo Bonzini) : 45.79
  • Unrolled Insertion Sort (Paolo Bonzini) : 80.15

I included both -O1 and -O2 results because surprisingly for several programs O2 is less efficient than O1. I wonder what specific optimization has this effect ?

Comments on proposed solutions

Insertion Sort (Daniel Stutzbach)

As expected minimizing branches is indeed a good idea.

Sorting Networks (Daniel Stutzbach)

Better than insertion sort. I wondered if the main effect was not get from avoiding the external loop. I gave it a try by unrolled insertion sort to check and indeed we get roughly the same figures (code is here).

Sorting Networks (Paul R)

The best so far. The actual code I used to test is here. Don't know yet why it is nearly two times as fast as the other sorting network implementation. Parameter passing ? Fast max ?

Sorting Networks 12 SWAP with Fast Swap

As suggested by Daniel Stutzbach, I combined his 12 swap sorting network with branchless fast swap (code is here). It is indeed faster, the best so far with a small margin (roughly 5%) as could be expected using 1 less swap.

It is also interesting to notice that the branchless swap seems to be much (4 times) less efficient than the simple one using if on PPC architecture.

Calling Library qsort

To give another reference point I also tried as suggested to just call library qsort (code is here). As expected it is much slower : 10 to 30 times slower... as it became obvious with the new test suite, the main problem seems to be the initial load of the library after the first call, and it compares not so poorly with other version. It is just between 3 and 20 times slower on my Linux. On some architecture used for tests by others it seems even to be faster (I'm really surprised by that one, as library qsort use a more complex API).

Rank order

Rex Kerr proposed another completely different method : for each item of the array compute directly its final position. This is efficient because computing rank order do not need branch. The drawback of this method is that it takes three times the amount of memory of the array (one copy of array and variables to store rank orders). The performance results are very surprising (and interesting). On my reference architecture with 32 bits OS and Intel Core2 Quad E8300, cycle count was slightly below 1000 (like sorting networks with branching swap). But when compiled and executed on my 64 bits box (Intel Core2 Duo) it performed much better : it became the fastest so far. I finally found out the true reason. My 32bits box use gcc 4.4.1 and my 64bits box gcc 4.4.3 and the last one seems much better at optimizing this particular code (there was very little difference for other proposals).

update:

As published figures above shows this effect was still enhanced by later versions of gcc and Rank Order became consistently twice as fast as any other alternative.

Sorting Networks 12 with reordered Swap

The amazing efficiency of the Rex Kerr proposal with gcc 4.4.3 made me wonder : how could a program with 3 times as much memory usage be faster than branchless sorting networks? My hypothesis was that it had less dependencies of the kind read after write, allowing for better use of the superscalar instruction scheduler of the x86. That gave me an idea: reorder swaps to minimize read after write dependencies. More simply put: when you do SWAP(1, 2); SWAP(0, 2); you have to wait for the first swap to be finished before performing the second one because both access to a common memory cell. When you do SWAP(1, 2); SWAP(4, 5);the processor can execute both in parallel. I tried it and it works as expected, the sorting networks is running about 10% faster.

Sorting Networks 12 with Simple Swap

One year after the original post Steinar H. Gunderson suggested, that we should not try to outsmart the compiler and keep the swap code simple. It's indeed a good idea as the resulting code is about 40% faster! He also proposed a swap optimized by hand using x86 inline assembly code that can still spare some more cycles. The most surprising (it says volumes on programmer's psychology) is that one year ago none of used tried that version of swap. Code I used to test is here. Others suggested other ways to write a C fast swap, but it yields the same performances as the simple one with a decent compiler.

The "best" code is now as follow:

static inline void sort6_sorting_network_simple_swap(int * d){
#define min(x, y) (x<y?x:y)
#define max(x, y) (x<y?y:x) 
#define SWAP(x,y) { const int a = min(d[x], d[y]); \
                    const int b = max(d[x], d[y]); \
                    d[x] = a; d[y] = b; }
    SWAP(1, 2);
    SWAP(4, 5);
    SWAP(0, 2);
    SWAP(3, 5);
    SWAP(0, 1);
    SWAP(3, 4);
    SWAP(1, 4);
    SWAP(0, 3);
    SWAP(2, 5);
    SWAP(1, 3);
    SWAP(2, 4);
    SWAP(2, 3);
#undef SWAP
#undef min
#undef max
}

If we believe our test set (and, yes it is quite poor, it's mere benefit is being short, simple and easy to understand what we are measuring), the average number of cycles of the resulting code for one sort is below 40 cycles (6 tests are executed). That put each swap at an average of 4 cycles. I call that amazingly fast. Any other improvements possible ?

VividD
  • 10,456
  • 6
  • 64
  • 111
kriss
  • 23,497
  • 17
  • 97
  • 116
  • It's not really golf unless you can objectively score the answers, so you need to specify a particular architecture (and whether it's going to be scored on average-case or worst-case). – caf May 07 '10 at 07:31
  • ints on a GPU... can you use floats instead? Then you have min/max functions available. (At least GLSL does not support min/max for ints.) Also, it is probably faster to use two `vec3` or similar types instead of an array, so you can use swizzling to sort. – Thomas May 07 '10 at 07:40
  • @Thomas: yes, you are right, it would be more logical using array of float. Really type of vector content is not relevant for the problem beyond that it is some kind of 'built-in' type hold in a register. – kriss May 07 '10 at 08:23
  • @caf: I will use some reference timer that read cycle register, like this one http://www.fit.vutbr.cz/~peringer/SIMLIB/doc/html/rdtsc_8h-source.html . In the GPU context I know it's cheating, but it keep rules easy for golfing purpose. – kriss May 07 '10 at 08:30
  • 2
    Do you have some constraints on the ints ? For example, can we assume that for any 2 x,y `x-y` and `x+y` won't cause underflow or overflow ? – Matthieu M. May 07 '10 at 09:49
  • in the context I have not information on the int (you can imagine they are arbitrary pixels coding colors), so no, sorry you can't spare a bit for the swap trick ;-) – kriss May 07 '10 at 11:16
  • 3
    You should try combining my 12-swap sorting network with Paul's branchless swap function. His solution passes all of the parameters as separate elements on the stack instead of a single pointer to an array. That might also make a difference. – Daniel Stutzbach May 07 '10 at 22:22
  • @kriss : As a non-C guy, I'm quite fascinated by this post. I'm just wondering what `-O1` , `-O2` , etc. are. Are they compiler-optimization levels? – Zaid May 08 '10 at 09:15
  • @Zaid: yes, that's the predefined optimizations sets of gcc (you can also define optimization features individually). You can see here http://gcc.gnu.org/onlinedocs/gcc/Optimize-Options.html what are those optimizations Other compilers have similar optimizations flags. – kriss May 08 '10 at 09:45
  • Great analysis, and good catch on reordering the swaps. From the page I linked to that generates the SWAP macros, it looks like you can get an optimal order with the "View the network using text characters" option instead of the "Create a set of SWAP macros" option. I'm emailing the author to see if can get the SWAP macros to output in the optimum order. – Daniel Stutzbach May 10 '10 at 20:00
  • 2
    Note that the correct implementation of rdtsc on 64-bit is `__asm__ volatile (".byte 0x0f, 0x31; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");` because rdtsc puts the answer in EDX:EAX while GCC expects it in a single 64-bit register. You can see the bug by compiling at -O3. Also see below my comment to Paul R about a faster SWAP. – Paolo Bonzini Aug 24 '11 at 09:24
  • Some advice - 6 loops is not enough for all platforms, also you should use a power of two test case to allow mod by power of two with and to repeat multiple times without polluting the results with potentially expensive mod operations. Here are some results for a platform I don't think I can name owing to NDA, but which uses PPC architecture: – jheriko Aug 24 '11 at 13:09
  • **Direct call to qsort library function** : 101 **Naive implementation (insertion sort)** : 299 **Insertion Sort (Daniel Stutzbach)** : 108 **Insertion Sort Unrolled** : 51 **Sorting Networks (Daniel Stutzbach)** : 26 **Sorting Networks (Paul R)** : 85 **Sorting Networks 12 with Fast Swap** : 117 **Sorting Networks 12 reordered Swap** : 116 **Rank Order** : 56 – jheriko Aug 24 '11 at 13:09
  • 1
    I've got a beef with the Sorting networks 12 with fast swap code: Look at the min & max functions: There's a common element that contains a branch. – Loren Pechtel Aug 24 '11 at 15:04
  • 1
    Maybe the compiler is smart enough to do this for you, but a small speedup in the min/max SWAP appears to be available via reusing the xor'd value from min/max, like this: `int s = (x^y) & -(x – Tyler Aug 24 '11 at 15:10
  • @Loren, there's no branch from computing `x – Tyler Aug 24 '11 at 15:13
  • 3
    @Tyler: How do you implement it at the assembly level without a branch? – Loren Pechtel Aug 24 '11 at 16:32
  • @Loren, it's just a cmp instruction which sets a status bit. – Kevin Stock Aug 24 '11 at 20:10
  • You're doing [Selection Sort](http://en.wikipedia.org/wiki/Selection_sort), not Insertion Sort. Otherwise, great analysis, with convincing proof. The swap-order optimization is most interesting. – cdunn2001 Aug 24 '11 at 21:14
  • @Loren, I think you're saying "the CPU has to have different behavior based on if x – Tyler Aug 24 '11 at 21:27
  • What is the point in minimising the execution speed? – bandi Aug 24 '11 at 21:42
  • @Tyler: I guess my assembly is too rusty as I don't recall any way to get that status bit into a register without jumping through a lot of hoops (push the flags, pop AX, mask and then shift. 4 cycles minimum {it's been a *LONG* time since I looked up cycles used, I think these are all one-cycle operations these days} and one memory stall.) The only bit I recall being able to use directly is doing an add with carry. – Loren Pechtel Aug 25 '11 at 04:19
  • 4
    @Loren: `CMP EAX, EBX; SBB EAX, EAX` will put either 0 or 0xFFFFFFFF in `EAX` depending on whether `EAX` is larger or smaller than `EBX`, respectively. `SBB` is "subtract with borrow", the counterpart of `ADC` ("add with carry"); the status bit you refer to *is* the carry bit. Then again, I remember that `ADC` and `SBB` had terrible latency & throughput on the Pentium 4 vs. `ADD` and `SUB`, and were still twice as slow on Core CPUs. Since the 80386 there are also `SETcc` conditional-store and `CMOVcc` conditional-move instructions, but they're also slow. – j_random_hacker Aug 25 '11 at 05:59
  • @bandi: as pointless as minimizing source length, as I stated it's a game, a kind of golf. But I believe the process has it's points. It learns us things on what type of code is efficient or not at several levels : algorithmic (usually the greater benefits), but also compiler and hardware levels. It also give some reference point as to the kind of performance that can be expected from a C program (if necessary). – kriss Aug 25 '11 at 06:25
  • @Loren: Pentium 2 and newer have fast conditional moves, see Steinar Gunderson's assembly code. – Paolo Bonzini Aug 26 '11 at 08:37
  • @kriss I think what you said is true for the case of *maximising* the speed, which is completely understandable. What I don't understand is that why would you minimise it? Why you want to write a slow program? – bandi Aug 26 '11 at 11:09
  • @bandi: bad word, my fault, I want to minimize execution *time* not speed ;-). Thanks to have spotted the typo. – kriss Aug 26 '11 at 11:51
  • "Zening", I like it. Not sure where to go with this word, but I will try to use it. – Kent Mar 04 '14 at 03:31
  • The links to copypastecode.com are broken; now it seems to be a dodgy advertising site. – Olaf Seibert Aug 25 '15 at 10:10
  • @Olaf: indeed copypastecode.com went down two years ago (!) As soon as I have time available I will put back the content on some working website (probably github). Thanks for the warning. – kriss Aug 26 '15 at 02:11
  • This question actually has little to do with C, and more to do with assembly, as you are inspecting the machine code that is produced. Additionally, you've used `__asm__` which really isn't *plain C*. – autistic Dec 08 '15 at 03:08
  • @seb 11: some posters used asm for answers, I used it to access hardware timers on x86 for accurate mesures but this is marginal. There is no more assembly here than in most typical libc implementations. Would you argue using libc makes it not plain C ? Most answers are pure C. But I disagree with you, knowing how the language is compiled under the hood (and what could be efficiently compiled on one target or another) is a very important part of C knowledge. Of course nobody is forced to read that question or it's answers. Don't care if you don't master that level of C. – kriss Dec 08 '15 at 14:40
  • 1
    For C++, I've recently written a [templated class to generate Bose-Nelson networks on compile time](https://github.com/webby1111/Static-Sort). With optimizations on, the performance is on par with the fastest handcoded answers here. – Vectorized Mar 24 '16 at 20:46
  • Hm... you specify that the language must be C, yet you say the code will be run on a GPU. How do you run C code on a GPU? – Stefan Monov Mar 12 '17 at 16:52
  • @kriss: Also, I see that you removed the "will be run on a GPU" statement. Yet the `gpgpu` tag remains. Do you no longer intend the question to be about GPU code? – Stefan Monov Mar 12 '17 at 16:59
  • @Stefan Monov: still interrested, but I believe that the code would be quite different on a GPU. Henceforth it probably should be another question. As I'm lazy I din't opened it. So if you have a good answer working on GPU feel free to give it. – kriss Mar 12 '17 at 22:43
  • To run C code on a GPU you can use CUDA or OpenCL. It raises some restrictions, but it's still C code and benefits from GPU. By the way, if you have a GPU, just sorting 6 numbers would probably be waste of power. – kriss Mar 12 '17 at 22:46
  • The pastebin contents are currently pretty messed up; e.g. it has all on one line `static inline void sort6_sorting_network_v1(int * d){ #include ` . – Don Hatch Nov 23 '21 at 15:40
  • @Don Hatch: not happening for me. Maybe you are using some boken text editor messing things with end of lines ? All I can say is it's working fine for me. – kriss Nov 23 '21 at 15:49
  • @kriss very odd. It's not my editor, I can see it damaged in my chrome browser, and even if I just download the pastebin using curl: `curl https://pastebin.com/azzuk072 | grep include` contains the damaged line: `static inline void sort6_sorting_network_v1(int * d){ #include <stdio.h>` – Don Hatch Nov 23 '21 at 16:04
  • Apparently when I view or download it, there's a second identical copy of the file inserted into the middle of the first copy of the file. If I just edit the file and remove that second copy (or the first copy), it works. – Don Hatch Nov 23 '21 at 16:09
  • @Don Hatch: my bad, I see also see the mixup now. Don't know how it happened. This hasn't changed for years – kriss Nov 23 '21 at 16:09

25 Answers25

166

For any optimization, it's always best to test, test, test. I would try at least sorting networks and insertion sort. If I were betting, I'd put my money on insertion sort based on past experience.

Do you know anything about the input data? Some algorithms will perform better with certain kinds of data. For example, insertion sort performs better on sorted or almost-sorted dat, so it will be the better choice if there's an above-average chance of almost-sorted data.

The algorithm you posted is similar to an insertion sort, but it looks like you've minimized the number of swaps at the cost of more comparisons. Comparisons are far more expensive than swaps, though, because branches can cause the instruction pipeline to stall.

Here's an insertion sort implementation:

static __inline__ int sort6(int *d){
        int i, j;
        for (i = 1; i < 6; i++) {
                int tmp = d[i];
                for (j = i; j >= 1 && tmp < d[j-1]; j--)
                        d[j] = d[j-1];
                d[j] = tmp;
        }
}

Here's how I'd build a sorting network. First, use this site to generate a minimal set of SWAP macros for a network of the appropriate length. Wrapping that up in a function gives me:

static __inline__ int sort6(int * d){
#define SWAP(x,y) if (d[y] < d[x]) { int tmp = d[x]; d[x] = d[y]; d[y] = tmp; }
    SWAP(1, 2);
    SWAP(0, 2);
    SWAP(0, 1);
    SWAP(4, 5);
    SWAP(3, 5);
    SWAP(3, 4);
    SWAP(0, 3);
    SWAP(1, 4);
    SWAP(2, 5);
    SWAP(2, 4);
    SWAP(1, 3);
    SWAP(2, 3);
#undef SWAP
}
Martijn
  • 6,713
  • 3
  • 31
  • 38
Daniel Stutzbach
  • 74,198
  • 17
  • 88
  • 77
  • 9
    +1: nice, you did it with 12 exchanges rather than the 13 in my hand-coded and empirically derived network above. I'd give you another +1 if I could for the link to the site that generates networks for you - now bookmarked. – Paul R May 07 '10 at 20:52
  • 9
    This is a fantastic idea for a general purpose sorting function if you expect the majority of requests to be small sized arrays. Use a switch statement for the cases that you want to optimize, using this procedure; let the default case use a library sort function. – Mark Ransom May 07 '10 at 21:47
  • 6
    @Mark A *good* library sort function will already have a fast-path for small arrays. Many modern libraries will use a recursive QuickSort or MergeSort that switches to InsertionSort after recursing down to `n < SMALL_CONSTANT`. – Daniel Stutzbach May 07 '10 at 22:16
  • @Daniel, good point; I should have thought of that. Doesn't that imply the correct answer to the question then is to just use the library sort? – Mark Ransom May 07 '10 at 22:25
  • 1
    @Mark: I believe the cost you have to pay just to call the library function (instead of static inline) is so high it defeats the library optimizations. But you are right, I should provide figures for plain library call to give a reference point. – kriss May 07 '10 at 22:47
  • 3
    @Mark Well, a C library sort function requires that you specify the comparison operation via a function porter. The overhead of calling a function for every comparison is huge. Usually, that's still the cleanest way to go, because this is rarely a critical path in the program. However, if it is the critical path, we really can sort much faster if we know we're sorting integers and exactly 6 of them. :) – Daniel Stutzbach May 07 '10 at 23:08
  • Would moving the int tmp out of the macro scope into function scope be of benefit? – EvilTeach Aug 24 '11 at 14:30
  • You can get rid of `tmp` entirely by using an XOR swap: `d[x]^=d[y]; d[y]^=d[x]; d[x]^=d[y];` – tghw Aug 24 '11 at 16:15
  • 8
    @tgwh: XOR swap is almost always a bad idea. – Paul R Jun 25 '13 at 15:20
  • 1
    @JohnnyW: There are a number of reasons - see e.g. [this question and answer](http://programmers.stackexchange.com/questions/182037/is-this-xor-value-swap-algorithm-still-in-use-or-useful) as a starting point. – Paul R Jul 07 '13 at 20:28
  • 1
    I found that replacing the `if` statement in the `SWAP` macro with a branch-free two-element sort (using min / max) doubles the speed of this code. – Alex Reinking Jul 13 '17 at 02:16
66

Here's an implementation using sorting networks:

inline void Sort2(int *p0, int *p1)
{
    const int temp = min(*p0, *p1);
    *p1 = max(*p0, *p1);
    *p0 = temp;
}

inline void Sort3(int *p0, int *p1, int *p2)
{
    Sort2(p0, p1);
    Sort2(p1, p2);
    Sort2(p0, p1);
}

inline void Sort4(int *p0, int *p1, int *p2, int *p3)
{
    Sort2(p0, p1);
    Sort2(p2, p3);
    Sort2(p0, p2);  
    Sort2(p1, p3);  
    Sort2(p1, p2);  
}

inline void Sort6(int *p0, int *p1, int *p2, int *p3, int *p4, int *p5)
{
    Sort3(p0, p1, p2);
    Sort3(p3, p4, p5);
    Sort2(p0, p3);  
    Sort2(p2, p5);  
    Sort4(p1, p2, p3, p4);  
}

You really need very efficient branchless min and max implementations for this, since that is effectively what this code boils down to - a sequence of min and max operations (13 of each, in total). I leave this as an exercise for the reader.

Note that this implementation lends itself easily to vectorization (e.g. SIMD - most SIMD ISAs have vector min/max instructions) and also to GPU implementations (e.g. CUDA - being branchless there are no problems with warp divergence etc).

See also: Fast algorithm implementation to sort very small list

Community
  • 1
  • 1
Paul R
  • 208,748
  • 37
  • 389
  • 560
  • 1
    For some bit hacks for min/max: http://graphics.stanford.edu/~seander/bithacks.html#IntegerMinOrMax – Rubys May 07 '10 at 08:20
  • 1
    @Paul: in the real CUDA use context, it's certainly the best answer. I will check if it also is (and how much) in golf x64 context and publish result. – kriss May 07 '10 at 12:06
  • 1
    `Sort3` would be faster (on most architectures, anyway) if you noted that `(a+b+c)-(min+max)` is the central number. – Rex Kerr May 07 '10 at 23:35
  • @Rex: interesting idea - it would require a widening of the data though, to prevent overflow, which would mean a performance impact in some cases (especially SIMD). It would be interesting to count the operations though: the above implementation of Sort3 is 3 `max` and 3 `min` operations for a total of 6 - how many operations do you think your method would be ? – Paul R May 08 '10 at 07:16
  • @Paul: Overflow doesn't matter--you underflow back into range again (unless this is some weird architecture that doesn't do integer math mod 2^32). My method is 1 min, 1 max, 2 add, 2 sub--and add/sub are usually faster than min/max. If they're the same, it should be equivalent. – Rex Kerr May 08 '10 at 12:55
  • 1
    @Rex: I see - that looks good. For SIMD architectures like AltiVec and SSE it would be the same number of instruction cycles (max and min are single cycle instructions like add/subtract), but for a normal scalar CPU your method looks better. – Paul R May 08 '10 at 16:13
  • 2
    If I let GCC optimize min with conditional move instructions I get a 33% speedup: `#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }`. Here I'm not using ?: for d[y] because it gives slightly worse performance, but it's almost in the noise. – Paolo Bonzini Aug 24 '11 at 08:55
48

Since these are integers and compares are fast, why not compute the rank order of each directly:

inline void sort6(int *d) {
  int e[6];
  memcpy(e,d,6*sizeof(int));
  int o0 = (d[0]>d[1])+(d[0]>d[2])+(d[0]>d[3])+(d[0]>d[4])+(d[0]>d[5]);
  int o1 = (d[1]>=d[0])+(d[1]>d[2])+(d[1]>d[3])+(d[1]>d[4])+(d[1]>d[5]);
  int o2 = (d[2]>=d[0])+(d[2]>=d[1])+(d[2]>d[3])+(d[2]>d[4])+(d[2]>d[5]);
  int o3 = (d[3]>=d[0])+(d[3]>=d[1])+(d[3]>=d[2])+(d[3]>d[4])+(d[3]>d[5]);
  int o4 = (d[4]>=d[0])+(d[4]>=d[1])+(d[4]>=d[2])+(d[4]>=d[3])+(d[4]>d[5]);
  int o5 = 15-(o0+o1+o2+o3+o4);
  d[o0]=e[0]; d[o1]=e[1]; d[o2]=e[2]; d[o3]=e[3]; d[o4]=e[4]; d[o5]=e[5];
}
Rex Kerr
  • 166,841
  • 26
  • 322
  • 407
  • @Rex: with gcc -O1 it's below 1000 cycles, quite fast but slower than sorting network. Any idea to improve code ? Maybe if we could avoid array copy... – kriss May 07 '10 at 23:47
  • @kriss: It's faster than the sorting network for me with -O2. Is there some reason why -O2 isn't okay, or is it slower for you on -O2 also? Maybe it's a difference in machine architecture? – Rex Kerr May 08 '10 at 01:22
  • @Rex: O2 is indeed also the best option for me with your program. Cycle count is around 950 (I launch program several times as result is never perfectly stable). Thus it is faster than the first Network Sort implementation (the one without branchless swap) but slower than the other two. But you are right, target architecture or exact processor model can make a difference. 400 cycles is not a big difference. My testing target is an Intel Core2 Quad Q8300@2.50GHz, stepping 0a (though with testing method frequency should be irrelevant). – kriss May 08 '10 at 06:22
  • @Rex: I also wonder if your method is really working on every dataset. I wonder if you do not have cases where several values are mapped to the same place when sorted data are repeated. – kriss May 08 '10 at 06:34
  • 1
    @Rex: sorry, I missed the > vs >= pattern at first sight. It works in every case. – kriss May 08 '10 at 06:41
  • @Rex: I tried your code on my other test machine (Intel Core 2 E8400 @ 3GHz with native Linux 64bits OS) and on it your program is the fastest (~370 cycles vs ~390). I should edit my question to provide results for both architectures (with your answer). – kriss May 08 '10 at 07:00
  • @kriss: I think a factor of 2 difference in cycles is quite large, especially since I was testing on a 2-core machine of the same vintage as the Q8300! – Rex Kerr May 08 '10 at 18:32
  • @Rex; I updated my answer. The true reason was version of compiler (gcc441 vs gcc 443) not target architecture. I didn't identified exactly what optimization. Your solution seems to hard push gcc. For example gcc443 yield much better results with O1 than with O2). I guess I will have to look at assembly code if I really want to understand why. – kriss May 09 '10 at 21:15
  • 3
    @kriss: Aha. That is not completely surprising--there are a lot of variables floating around, and they have to be carefully ordered and cached in registers and so on. – Rex Kerr May 09 '10 at 22:32
  • The efficiency of this approach depends on a lot on how easy it is to do (int)(a>=b) on the particular machine. Pentiums gave the 'setcc' instruction allowing this to be done in two instructions but on some processors it could require more complex code. Although on almost anything modern it should be possible without a branch, compilers may not always oblige. If you can assume no overflow in the subtract, ((a-b)>>31) gives 0 or -1, and that could be used instead. – greggo Sep 18 '14 at 13:43
  • A big 'plus' here is that there are no 'stores' in the main part, so the optimizer will be able to easily work in registers, and only read each d[i] once. If the compiler can determine that, e.g. e[0] has already been read as d[0] (and you have enough registers), it would be even better. That could be made explicit by using 6 local variables d0..d5 instead of the memcpy to e. – greggo Sep 18 '14 at 13:53
  • 2
    @SSpoke `0+1+2+3+4+5=15` Since one of them is missing, 15 minus the sum of the rest yields missing one – Glenn Teitelbaum Jun 01 '16 at 20:51
  • Since `d[o0]=e[0];` same as `d[o0]=d[0];`, could use `memcpy(e+1,d+1,5*sizeof(int));`. – chux - Reinstate Monica Jul 09 '22 at 14:44
35

Looks like I got to the party a year late, but here we go...

Looking at the assembly generated by gcc 4.5.2 I observed that loads and stores are being done for every swap, which really isn't needed. It would be better to load the 6 values into registers, sort those, and store them back into memory. I ordered the loads at stores to be as close as possible to there the registers are first needed and last used. I also used Steinar H. Gunderson's SWAP macro. Update: I switched to Paolo Bonzini's SWAP macro which gcc converts into something similar to Gunderson's, but gcc is able to better order the instructions since they aren't given as explicit assembly.

I used the same swap order as the reordered swap network given as the best performing, although there may be a better ordering. If I find some more time I'll generate and test a bunch of permutations.

I changed the testing code to consider over 4000 arrays and show the average number of cycles needed to sort each one. On an i5-650 I'm getting ~34.1 cycles/sort (using -O3), compared to the original reordered sorting network getting ~65.3 cycles/sort (using -O1, beats -O2 and -O3).

#include <stdio.h>

static inline void sort6_fast(int * d) {
#define SWAP(x,y) { int dx = x, dy = y, tmp; tmp = x = dx < dy ? dx : dy; y ^= dx ^ tmp; }
    register int x0,x1,x2,x3,x4,x5;
    x1 = d[1];
    x2 = d[2];
    SWAP(x1, x2);
    x4 = d[4];
    x5 = d[5];
    SWAP(x4, x5);
    x0 = d[0];
    SWAP(x0, x2);
    x3 = d[3];
    SWAP(x3, x5);
    SWAP(x0, x1);
    SWAP(x3, x4);
    SWAP(x1, x4);
    SWAP(x0, x3);
    d[0] = x0;
    SWAP(x2, x5);
    d[5] = x5;
    SWAP(x1, x3);
    d[1] = x1;
    SWAP(x2, x4);
    d[4] = x4;
    SWAP(x2, x3);
    d[2] = x2;
    d[3] = x3;

#undef SWAP
#undef min
#undef max
}

static __inline__ unsigned long long rdtsc(void)
{
    unsigned long long int x;
    __asm__ volatile ("rdtsc; shlq $32, %%rdx; orq %%rdx, %0" : "=a" (x) : : "rdx");
    return x;
}

void ran_fill(int n, int *a) {
    static int seed = 76521;
    while (n--) *a++ = (seed = seed *1812433253 + 12345);
}

#define NTESTS 4096
int main() {
    int i;
    int d[6*NTESTS];
    ran_fill(6*NTESTS, d);

    unsigned long long cycles = rdtsc();
    for (i = 0; i < 6*NTESTS ; i+=6) {
        sort6_fast(d+i);
    }
    cycles = rdtsc() - cycles;
    printf("Time is %.2lf\n", (double)cycles/(double)NTESTS);

    for (i = 0; i < 6*NTESTS ; i+=6) {
        if (d[i+0] > d[i+1] || d[i+1] > d[i+2] || d[i+2] > d[i+3] || d[i+3] > d[i+4] || d[i+4] > d[i+5])
            printf("d%d : %d %d %d %d %d %d\n", i,
                    d[i+0], d[i+1], d[i+2],
                    d[i+3], d[i+4], d[i+5]);
    }
    return 0;
}

I changed modified the test suite to also report clocks per sort and run more tests (the cmp function was updated to handle integer overflow as well), here are the results on some different architectures. I attempted testing on an AMD cpu but rdtsc isn't reliable on the X6 1100T I have available.

Clarkdale (i5-650)
==================
Direct call to qsort library function      635.14   575.65   581.61   577.76   521.12
Naive implementation (insertion sort)      538.30   135.36   134.89   240.62   101.23
Insertion Sort (Daniel Stutzbach)          424.48   159.85   160.76   152.01   151.92
Insertion Sort Unrolled                    339.16   125.16   125.81   129.93   123.16
Rank Order                                 184.34   106.58   54.74    93.24    94.09
Rank Order with registers                  127.45   104.65   53.79    98.05    97.95
Sorting Networks (Daniel Stutzbach)        269.77   130.56   128.15   126.70   127.30
Sorting Networks (Paul R)                  551.64   103.20   64.57    73.68    73.51
Sorting Networks 12 with Fast Swap         321.74   61.61    63.90    67.92    67.76
Sorting Networks 12 reordered Swap         318.75   60.69    65.90    70.25    70.06
Reordered Sorting Network w/ fast swap     145.91   34.17    32.66    32.22    32.18

Kentsfield (Core 2 Quad)
========================
Direct call to qsort library function      870.01   736.39   723.39   725.48   721.85
Naive implementation (insertion sort)      503.67   174.09   182.13   284.41   191.10
Insertion Sort (Daniel Stutzbach)          345.32   152.84   157.67   151.23   150.96
Insertion Sort Unrolled                    316.20   133.03   129.86   118.96   105.06
Rank Order                                 164.37   138.32   46.29    99.87    99.81
Rank Order with registers                  115.44   116.02   44.04    116.04   116.03
Sorting Networks (Daniel Stutzbach)        230.35   114.31   119.15   110.51   111.45
Sorting Networks (Paul R)                  498.94   77.24    63.98    62.17    65.67
Sorting Networks 12 with Fast Swap         315.98   59.41    58.36    60.29    55.15
Sorting Networks 12 reordered Swap         307.67   55.78    51.48    51.67    50.74
Reordered Sorting Network w/ fast swap     149.68   31.46    30.91    31.54    31.58

Sandy Bridge (i7-2600k)
=======================
Direct call to qsort library function      559.97   451.88   464.84   491.35   458.11
Naive implementation (insertion sort)      341.15   160.26   160.45   154.40   106.54
Insertion Sort (Daniel Stutzbach)          284.17   136.74   132.69   123.85   121.77
Insertion Sort Unrolled                    239.40   110.49   114.81   110.79   117.30
Rank Order                                 114.24   76.42    45.31    36.96    36.73
Rank Order with registers                  105.09   32.31    48.54    32.51    33.29
Sorting Networks (Daniel Stutzbach)        210.56   115.68   116.69   107.05   124.08
Sorting Networks (Paul R)                  364.03   66.02    61.64    45.70    44.19
Sorting Networks 12 with Fast Swap         246.97   41.36    59.03    41.66    38.98
Sorting Networks 12 reordered Swap         235.39   38.84    47.36    38.61    37.29
Reordered Sorting Network w/ fast swap     115.58   27.23    27.75    27.25    26.54

Nehalem (Xeon E5640)
====================
Direct call to qsort library function      911.62   890.88   681.80   876.03   872.89
Naive implementation (insertion sort)      457.69   236.87   127.68   388.74   175.28
Insertion Sort (Daniel Stutzbach)          317.89   279.74   147.78   247.97   245.09
Insertion Sort Unrolled                    259.63   220.60   116.55   221.66   212.93
Rank Order                                 140.62   197.04   52.10    163.66   153.63
Rank Order with registers                  84.83    96.78    50.93    109.96   54.73
Sorting Networks (Daniel Stutzbach)        214.59   220.94   118.68   120.60   116.09
Sorting Networks (Paul R)                  459.17   163.76   56.40    61.83    58.69
Sorting Networks 12 with Fast Swap         284.58   95.01    50.66    53.19    55.47
Sorting Networks 12 reordered Swap         281.20   96.72    44.15    56.38    54.57
Reordered Sorting Network w/ fast swap     128.34   50.87    26.87    27.91    28.02
Kevin Stock
  • 506
  • 4
  • 9
  • Your idea of register variables should be applied to Rex Kerr's "Rank Order" solution. That should be fastest, and perhaps then the `-O3` optimization will not be counter-productive. – cdunn2001 Aug 24 '11 at 21:30
  • 1
    @cdunn2001 I just tested it, I'm not seeing improvement (except a few cycles at -O0 and -Os). Looking at the asm it appears gcc already managed to figure out to use registers and eliminate the call to memcpy. – Kevin Stock Aug 24 '11 at 22:06
  • Would you mind to add the simple swap version to you test suite, I guess it could be interesting to compare it with assembly fast swap optimized by hand. – kriss Aug 25 '11 at 07:43
  • 1
    Your code still uses Gunderson's swap, mine would be `#define SWAP(x,y) { int oldx = x; x = x < y ? x : y; y ^= oldx ^ x; }`. – Paolo Bonzini Aug 25 '11 at 07:49
  • @Paolo Bonzini: Yes, I intend to add a test case with yours, just not had time yet. But I will avoid inline assembly. – kriss Aug 25 '11 at 10:04
  • @Paolo Bonzini: also it seems that using two temporary variables instead of one in swap has a positive effect, but i'm not really sure of this one. – kriss Aug 25 '11 at 10:07
  • @Paolo Bonzini: I just forgot to update the code on this page (the pastebin link has the correct code). The results are using your macro. Fixed. – Kevin Stock Aug 25 '11 at 10:58
  • @BlackBear: I thought nothing was faster than light. Oh, I see the light. It's registers :-) – kriss Aug 25 '11 at 23:38
  • It seems to me that the ordering of the loads is likely to cause stalls. Since all the loads are happening directly before the first use, the swap has to wait for the variables to be loaded, causing the stall. The interleaved stores are ok, you should hopefully get come overlap between the data being stored and the next swap occurring. Anyway, this is probably all a moot point for a sufficiently good compiler, it'll reorder things to that work occurs while it's waiting for data to be fetched from RAM. It's quite possible that doing all the loads in one block would be quite beneficial on a GPU. – Axman6 Aug 29 '11 at 09:56
  • @Axman6 You're right. In one of my earlier variants I found that interleaving the loads was slightly beneficial, and so it stuck around. However, the compiler now moves most or all of the loads to the beginning. – Kevin Stock Aug 29 '11 at 13:31
  • There are exactly 720 different orderings of 6 elements. The 'best' sort in one sense would be one that had the best worst-case performance among those 720 orderings. – Yakk - Adam Nevraumont Sep 04 '14 at 13:46
  • @cdunn2001 the "Rank Order" solution will tend to be mostly in regs anyway since it defers the stores until all the main calculation is done. Could be improved by using reads into 6 locals instead of the memcpy to 'e' though. I wouldn't be surprised if gcc can do that itself (it *can* treat a small local array as separate int vars; here it would also need to transform the memcpy into 6 separate ops e[0]=d[0]; e[1] = d[1] .. in order to make this optimization. But it would be better to make that explicit in the code). – greggo Sep 18 '14 at 14:02
15

I stumbled onto this question from Google a few days ago because I also had a need to quickly sort a fixed length array of 6 integers. In my case however, my integers are only 8 bits (instead of 32) and I do not have a strict requirement of only using C. I thought I would share my findings anyways, in case they might be helpful to someone...

I implemented a variant of a network sort in assembly that uses SSE to vectorize the compare and swap operations, to the extent possible. It takes six "passes" to completely sort the array. I used a novel mechanism to directly convert the results of PCMPGTB (vectorized compare) to shuffle parameters for PSHUFB (vectorized swap), using only a PADDB (vectorized add) and in some cases also a PAND (bitwise AND) instruction.

This approach also had the side effect of yielding a truly branchless function. There are no jump instructions whatsoever.

It appears that this implementation is about 38% faster than the implementation which is currently marked as the fastest option in the question ("Sorting Networks 12 with Simple Swap"). I modified that implementation to use char array elements during my testing, to make the comparison fair.

I should note that this approach can be applied to any array size up to 16 elements. I expect the relative speed advantage over the alternatives to grow larger for the bigger arrays.

The code is written in MASM for x86_64 processors with SSSE3. The function uses the "new" Windows x64 calling convention. Here it is...

PUBLIC simd_sort_6

.DATA

ALIGN 16

pass1_shuffle   OWORD   0F0E0D0C0B0A09080706040503010200h
pass1_add       OWORD   0F0E0D0C0B0A09080706050503020200h
pass2_shuffle   OWORD   0F0E0D0C0B0A09080706030405000102h
pass2_and       OWORD   00000000000000000000FE00FEFE00FEh
pass2_add       OWORD   0F0E0D0C0B0A09080706050405020102h
pass3_shuffle   OWORD   0F0E0D0C0B0A09080706020304050001h
pass3_and       OWORD   00000000000000000000FDFFFFFDFFFFh
pass3_add       OWORD   0F0E0D0C0B0A09080706050404050101h
pass4_shuffle   OWORD   0F0E0D0C0B0A09080706050100020403h
pass4_and       OWORD   0000000000000000000000FDFD00FDFDh
pass4_add       OWORD   0F0E0D0C0B0A09080706050403020403h
pass5_shuffle   OWORD   0F0E0D0C0B0A09080706050201040300h
pass5_and       OWORD 0000000000000000000000FEFEFEFE00h
pass5_add       OWORD   0F0E0D0C0B0A09080706050403040300h
pass6_shuffle   OWORD   0F0E0D0C0B0A09080706050402030100h
pass6_add       OWORD   0F0E0D0C0B0A09080706050403030100h

.CODE

simd_sort_6 PROC FRAME

    .endprolog

    ; pxor xmm4, xmm4
    ; pinsrd xmm4, dword ptr [rcx], 0
    ; pinsrb xmm4, byte ptr [rcx + 4], 4
    ; pinsrb xmm4, byte ptr [rcx + 5], 5
    ; The benchmarked 38% faster mentioned in the text was with the above slower sequence that tied up the shuffle port longer.  Same on extract
    ; avoiding pins/extrb also means we don't need SSE 4.1, but SSSE3 CPUs without SSE4.1 (e.g. Conroe/Merom) have slow pshufb.
    movd    xmm4, dword ptr [rcx]
    pinsrw  xmm4,  word ptr [rcx + 4], 2  ; word 2 = bytes 4 and 5


    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass1_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass1_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass2_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass2_and]
    paddb xmm5, oword ptr [pass2_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass3_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass3_and]
    paddb xmm5, oword ptr [pass3_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass4_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass4_and]
    paddb xmm5, oword ptr [pass4_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass5_shuffle]
    pcmpgtb xmm5, xmm4
    pand xmm5, oword ptr [pass5_and]
    paddb xmm5, oword ptr [pass5_add]
    pshufb xmm4, xmm5

    movdqa xmm5, xmm4
    pshufb xmm5, oword ptr [pass6_shuffle]
    pcmpgtb xmm5, xmm4
    paddb xmm5, oword ptr [pass6_add]
    pshufb xmm4, xmm5

    ;pextrd dword ptr [rcx], xmm4, 0    ; benchmarked with this
    ;pextrb byte ptr [rcx + 4], xmm4, 4 ; slower version
    ;pextrb byte ptr [rcx + 5], xmm4, 5
    movd   dword ptr [rcx], xmm4
    pextrw  word ptr [rcx + 4], xmm4, 2  ; x86 is little-endian, so this is the right order

    ret

simd_sort_6 ENDP

END

You can compile this to an executable object and link it into your C project. For instructions on how to do this in Visual Studio, you can read this article. You can use the following C prototype to call the function from your C code:

void simd_sort_6(char *values);
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Joe Crivello
  • 52
  • 1
  • 10
  • It would be interresting to compare yours with other assembly level proposals. The compared performances of implementation does not include them. Using SSE sounds good anyway. – kriss Dec 03 '12 at 12:33
  • Another area of future research would be the application of the new Intel AVX instructions to this problem. The larger 256-bit vectors are large enough to fit 8 DWORDs. – Joe Crivello Dec 03 '12 at 17:08
  • 1
    Instead of `pxor / pinsrd xmm4, mem, 0`, just use `movd`! – Peter Cordes Sep 01 '15 at 19:00
15

The test code is pretty bad; it overflows the initial array (don't people here read compiler warnings?), the printf is printing out the wrong elements, it uses .byte for rdtsc for no good reason, there's only one run (!), there's nothing checking that the end results are actually correct (so it's very easy to “optimize” into something subtly wrong), the included tests are very rudimentary (no negative numbers?) and there's nothing to stop the compiler from just discarding the entire function as dead code.

That being said, it's also pretty easy to improve on the bitonic network solution; simply change the min/max/SWAP stuff to

#define SWAP(x,y) { int tmp; asm("mov %0, %2 ; cmp %1, %0 ; cmovg %1, %0 ; cmovg %2, %1" : "=r" (d[x]), "=r" (d[y]), "=r" (tmp) : "0" (d[x]), "1" (d[y]) : "cc"); }

and it comes out about 65% faster for me (Debian gcc 4.4.5 with -O2, amd64, Core i7).

  • OK, test code is poor. Feel free to improve it. And yes, you can use assembly code. Why not going all the way and fully code it using x86 assembler ? It may be a bit less portable but why bother ? – kriss Aug 24 '11 at 16:30
  • Thanks for noticing the array overflow, I corrected it. Other people may not have noticed it because the clicked on the link to copy/paste code, where there is no overflow. – kriss Aug 24 '11 at 16:40
  • 4
    You dont' even need assembler, actually; if you just drop all the clever tricks, GCC will recognize the sequence and insert the conditional moves for you: #define min(a, b) ((a < b) ? a : b) #define max(a, b) ((a < b) ? b : a) #define SWAP(x,y) { int a = min(d[x], d[y]); int b = max(d[x], d[y]); d[x] = a; d[y] = b; } It comes out maybe a few percent slower than the inline asm variant, but that's hard to say given the lack of proper benchmarking. – Steinar H. Gunderson Aug 24 '11 at 19:35
  • 3
    …and finally, if your numbers are floats, and you don't have to worry about NaN etc., GCC can convert this to minss/maxss SSE instructions, which is yet ~25% faster. Morale: Drop the clever bitfiddling tricks and let the compiler do its job. :-) – Steinar H. Gunderson Aug 24 '11 at 19:40
13

While I really like the swap macro provided:

#define min(x, y) (y ^ ((x ^ y) & -(x < y)))
#define max(x, y) (x ^ ((x ^ y) & -(x < y)))
#define SWAP(x,y) { int tmp = min(d[x], d[y]); d[y] = max(d[x], d[y]); d[x] = tmp; }

I see an improvement (which a good compiler might make):

#define SWAP(x,y) { int tmp = ((x ^ y) & -(y < x)); y ^= tmp; x ^= tmp; }

We take note of how min and max work and pull the common sub-expression explicitly. This eliminates the min and max macros completely.

technosaurus
  • 7,676
  • 1
  • 30
  • 52
phkahler
  • 5,687
  • 1
  • 23
  • 31
  • That gets them backwards, notice that d[y] gets the max, which is x^(common subexpression). – Kevin Stock Aug 24 '11 at 15:09
  • I noticed the same thing; I think for your implementation to be correct you want `d[x]` instead of `x` (same for `y`), and `d[y] < d[x]` for the inequality here (yep, different from the min/max code). – Tyler Aug 24 '11 at 15:18
  • I tried with your swap, but local optimization has negative effects at larger level (I guess it introduce dependencies). And the result is slower than the other swap. But as you can see with new solution proposed there was indeed much performance to gain optimizing swap. – kriss Aug 25 '11 at 06:41
12

Never optimize min/max without benchmarking and looking at actual compiler generated assembly. If I let GCC optimize min with conditional move instructions I get a 33% speedup:

#define SWAP(x,y) { int dx = d[x], dy = d[y], tmp; tmp = d[x] = dx < dy ? dx : dy; d[y] ^= dx ^ tmp; }

(280 vs. 420 cycles in the test code). Doing max with ?: is more or less the same, almost lost in the noise, but the above is a little bit faster. This SWAP is faster with both GCC and Clang.

Compilers are also doing an exceptional job at register allocation and alias analysis, effectively moving d[x] into local variables upfront, and only copying back to memory at the end. In fact, they do so even better than if you worked entirely with local variables (like d0 = d[0], d1 = d[1], d2 = d[2], d3 = d[3], d4 = d[4], d5 = d[5]). I'm writing this because you are assuming strong optimization and yet trying to outsmart the compiler on min/max. :)

By the way, I tried Clang and GCC. They do the same optimization, but due to scheduling differences the two have some variation in the results, can't say really which is faster or slower. GCC is faster on the sorting networks, Clang on the quadratic sorts.

Just for completeness, unrolled bubble sort and insertion sorts are possible too. Here is the bubble sort:

SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4); SWAP(4,5);
SWAP(0,1); SWAP(1,2); SWAP(2,3); SWAP(3,4);
SWAP(0,1); SWAP(1,2); SWAP(2,3);
SWAP(0,1); SWAP(1,2);
SWAP(0,1);

and here is the insertion sort:

//#define ITER(x) { if (t < d[x]) { d[x+1] = d[x]; d[x] = t; } }
//Faster on x86, probably slower on ARM or similar:
#define ITER(x) { d[x+1] ^= t < d[x] ? d[x] ^ d[x+1] : 0; d[x] = t < d[x] ? t : d[x]; }
static inline void sort6_insertion_sort_unrolled_v2(int * d){
    int t;
    t = d[1]; ITER(0);
    t = d[2]; ITER(1); ITER(0);
    t = d[3]; ITER(2); ITER(1); ITER(0);
    t = d[4]; ITER(3); ITER(2); ITER(1); ITER(0);
    t = d[5]; ITER(4); ITER(3); ITER(2); ITER(1); ITER(0);

This insertion sort is faster than Daniel Stutzbach's, and is especially good on a GPU or a computer with predication because ITER can be done with only 3 instructions (vs. 4 for SWAP). For example, here is the t = d[2]; ITER(1); ITER(0); line in ARM assembly:

    MOV    r6, r2
    CMP    r6, r1
    MOVLT  r2, r1
    MOVLT  r1, r6
    CMP    r6, r0
    MOVLT  r1, r0
    MOVLT  r0, r6

For six elements the insertion sort is competitive with the sorting network (12 swaps vs. 15 iterations balances 4 instructions/swap vs. 3 instructions/iteration); bubble sort of course is slower. But it's not going to be true when the size grows, since insertion sort is O(n^2) while sorting networks are O(n log n).

Paolo Bonzini
  • 1,900
  • 15
  • 25
  • 1
    More or less related: I submitted [a report](https://gcc.gnu.org/bugzilla/show_bug.cgi?id=67962) to GCC so that it could implement the optimization directly in the compiler. Not sure that it will be done, but at least you can follow how it evolves. – Morwenn Oct 28 '15 at 14:18
11

I ported the test suite to a PPC architecture machine I can not identify (didn't have to touch code, just increase the iterations of the test, use 8 test cases to avoid polluting results with mods and replace the x86 specific rdtsc):

Direct call to qsort library function : 101

Naive implementation (insertion sort) : 299

Insertion Sort (Daniel Stutzbach) : 108

Insertion Sort Unrolled : 51

Sorting Networks (Daniel Stutzbach) : 26

Sorting Networks (Paul R) : 85

Sorting Networks 12 with Fast Swap : 117

Sorting Networks 12 reordered Swap : 116

Rank Order : 56

jheriko
  • 3,043
  • 1
  • 21
  • 28
  • 1
    Really interesting. It looks like the branchless swap is a bad idea on PPC. It may also be a compiler related effect. Which one was used ? – kriss Aug 24 '11 at 16:03
  • Its a branch of the gcc compiler - the min, max logic is probably not branchless - i will inspect disassembly and let you know, but unless the compiler is clever enough including something like x < y without an if still becomes a branch - on x86/x64 the CMOV instruction might avoid this, but there is no such instruction for fixed point values on PPC, only floats. I might dabble with this tomorrow and let you know - I remember there was a much simpler branchless min/max in the Winamp AVS source, but iirc it was for floats only - but might be a good start towards a truly branchless approach. – jheriko Aug 25 '11 at 00:48
  • 4
    Here is a branchless min/max for PPC with unsigned inputs: `subfc r5,r4,r3; subfe r6,r6,r6; andc r6,r5,r6; add r4,r6,r4; subf r3,r6,r3`. r3/r4 are inputs, r5/r6 are scratch registers, on output r3 gets the min and r4 gets the max. It should be decently schedulable by hand. I found it with the GNU superoptimizer, starting from 4-instructions min and max sequences and looking manually for two that could be combined. For signed inputs, you can of course add 0x80000000 to all elements at the beginning and subtract it again at the end, and then work as if they were unsigned. – Paolo Bonzini Aug 25 '11 at 07:44
7

An XOR swap may be useful in your swapping functions.

void xorSwap (int *x, int *y) {
     if (*x != *y) {
         *x ^= *y;
         *y ^= *x;
         *x ^= *y;
     }
 }

The if may cause too much divergence in your code, but if you have a guarantee that all your ints are unique this could be handy.

naj
  • 61
  • 1
  • 3
  • 1
    xor swap works for equal values as well... x^=y sets x to 0, y^=x leaves y as y (==x), x^=y sets x to y – jheriko Aug 24 '11 at 12:46
  • 11
    When it *doesn't* work is when `x` and `y` point to the same location. – hobbs Aug 24 '11 at 13:48
  • Anyway when used with sorting networks we never call with both x and y pointing to the same location. There is still to find a way to avoid testing wich is greater to get the same effect as the branchless swap. I have an idea to achieve that. – kriss Aug 24 '11 at 15:59
5

Looking forward to trying my hand at this and learning from these examples, but first some timings from my 1.5 GHz PPC Powerbook G4 w/ 1 GB DDR RAM. (I borrowed a similar rdtsc-like timer for PPC from http://www.mcs.anl.gov/~kazutomo/rdtsc.html for the timings.) I ran the program a few times and the absolute results varied but the consistently fastest test was "Insertion Sort (Daniel Stutzbach)", with "Insertion Sort Unrolled" a close second.

Here's the last set of times:

**Direct call to qsort library function** : 164
**Naive implementation (insertion sort)** : 138
**Insertion Sort (Daniel Stutzbach)**     : 85
**Insertion Sort Unrolled**               : 97
**Sorting Networks (Daniel Stutzbach)**   : 457
**Sorting Networks (Paul R)**             : 179
**Sorting Networks 12 with Fast Swap**    : 238
**Sorting Networks 12 reordered Swap**    : 236
**Rank Order**                            : 116
Nico
  • 1
  • 1
  • 1
4

Here is my contribution to this thread: an optimized 1, 4 gap shellsort for a 6-member int vector (valp) containing unique values.

void shellsort (int *valp)
{      
  int c,a,*cp,*ip=valp,*ep=valp+5;

  c=*valp;    a=*(valp+4);if (c>a) {*valp=    a;*(valp+4)=c;}
  c=*(valp+1);a=*(valp+5);if (c>a) {*(valp+1)=a;*(valp+5)=c;}

  cp=ip;    
  do
  {
    c=*cp;
    a=*(cp+1);
    do
    {
      if (c<a) break;

      *cp=a;
      *(cp+1)=c;
      cp-=1;
      c=*cp;
    } while (cp>=valp);
    ip+=1;
    cp=ip;
  } while (ip<ep);
}

On my HP dv7-3010so laptop with a dual-core Athlon M300 @ 2 Ghz (DDR2 memory) it executes in 165 clock cycles. This is an average calculated from timing every unique sequence (6!/720 in all). Compiled to Win32 using OpenWatcom 1.8. The loop is essentially an insertion sort and is 16 instructions/37 bytes long.

I do not have a 64-bit environment to compile on.

Olof Forshell
  • 3,169
  • 22
  • 28
3

I know I'm super-late, but I was interested in experimenting with some different solutions. First, I cleaned up that paste, made it compile, and put it into a repository. I kept some undesirable solutions as dead-ends so that others wouldn't try it. Among this was my first solution, which attempted to ensure that x1>x2 was calculated once. After optimization, it is no faster than the other, simple versions.

I added a looping version of rank order sort, since my own application of this study is for sorting 2-8 items, so since there are a variable number of arguments, a loop is necessary. This is also why I ignored the sorting network solutions.

The test code didn't test that duplicates were handled correctly, so while the existing solutions were all correct, I added a special case to the test code to ensure that duplicates were handled correctly.

Then, I wrote an insertion sort that is entirely in AVX registers. On my machine it is 25% faster than the other insertion sorts, but 100% slower than rank order. I did this purely for experiment and had no expectation of this being better due to the branching in insertion sort.

static inline void sort6_insertion_sort_avx(int* d) {
    __m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], 0, 0);
    __m256i index = _mm256_setr_epi32(0, 1, 2, 3, 4, 5, 6, 7);
    __m256i shlpermute = _mm256_setr_epi32(7, 0, 1, 2, 3, 4, 5, 6);
    __m256i sorted = _mm256_setr_epi32(d[0], INT_MAX, INT_MAX, INT_MAX,
            INT_MAX, INT_MAX, INT_MAX, INT_MAX);
    __m256i val, gt, permute;
    unsigned j;
     // 8 / 32 = 2^-2
#define ITER(I) \
        val = _mm256_permutevar8x32_epi32(src, _mm256_set1_epi32(I));\
        gt =  _mm256_cmpgt_epi32(sorted, val);\
        permute =  _mm256_blendv_epi8(index, shlpermute, gt);\
        j = ffs( _mm256_movemask_epi8(gt)) >> 2;\
        sorted = _mm256_blendv_epi8(_mm256_permutevar8x32_epi32(sorted, permute),\
                val, _mm256_cmpeq_epi32(index, _mm256_set1_epi32(j)))
    ITER(1);
    ITER(2);
    ITER(3);
    ITER(4);
    ITER(5);
    int x[8];
    _mm256_storeu_si256((__m256i*)x, sorted);
    d[0] = x[0]; d[1] = x[1]; d[2] = x[2]; d[3] = x[3]; d[4] = x[4]; d[5] = x[5];
#undef ITER
}

Then, I wrote a rank order sort using AVX. This matches the speed of the other rank-order solutions, but is no faster. The issue here is that I can only calculate the indices with AVX, and then I have to make a table of indices. This is because the calculation is destination-based rather than source-based. See Converting from Source-based Indices to Destination-based Indices

static inline void sort6_rank_order_avx(int* d) {
    __m256i ror = _mm256_setr_epi32(5, 0, 1, 2, 3, 4, 6, 7);
    __m256i one = _mm256_set1_epi32(1);
    __m256i src = _mm256_setr_epi32(d[0], d[1], d[2], d[3], d[4], d[5], INT_MAX, INT_MAX);
    __m256i rot = src;
    __m256i index = _mm256_setzero_si256();
    __m256i gt, permute;
    __m256i shl = _mm256_setr_epi32(1, 2, 3, 4, 5, 6, 6, 6);
    __m256i dstIx = _mm256_setr_epi32(0,1,2,3,4,5,6,7);
    __m256i srcIx = dstIx;
    __m256i eq = one;
    __m256i rotIx = _mm256_setzero_si256();
#define INC(I)\
    rot = _mm256_permutevar8x32_epi32(rot, ror);\
    gt = _mm256_cmpgt_epi32(src, rot);\
    index = _mm256_add_epi32(index, _mm256_and_si256(gt, one));\
    index = _mm256_add_epi32(index, _mm256_and_si256(eq,\
                _mm256_cmpeq_epi32(src, rot)));\
    eq = _mm256_insert_epi32(eq, 0, I)
    INC(0);
    INC(1);
    INC(2);
    INC(3);
    INC(4);
    int e[6];
    e[0] = d[0]; e[1] = d[1]; e[2] = d[2]; e[3] = d[3]; e[4] = d[4]; e[5] = d[5];
    int i[8];
    _mm256_storeu_si256((__m256i*)i, index);
    d[i[0]] = e[0]; d[i[1]] = e[1]; d[i[2]] = e[2]; d[i[3]] = e[3]; d[i[4]] = e[4]; d[i[5]] = e[5];
}

The repo can be found here: https://github.com/eyepatchParrot/sort6/

Community
  • 1
  • 1
eyepatch
  • 85
  • 6
  • 1
    You can use `vmovmskps` on integer vectors (with a cast to keep the intrinsics happy), avoiding the need to right-shift the bitscan (`ffs`) result. – Peter Cordes Jun 09 '19 at 14:48
  • 1
    You can conditionally add 1 based on a `cmpgt` result by *subtracting* it, instead of masking it with `set1(1)`. e.g. `index = _mm256_sub_epi32(index, gt)` does `index -= -1 or 0;` – Peter Cordes Jun 09 '19 at 14:49
  • 1
    `eq = _mm256_insert_epi32(eq, 0, I)` is not an efficient way to zero an element if it compiles as written (especially for elements outside the low 4, because `vpinsrd` is only available with an XMM destination; indices higher than 3 have to be emulated). Instead, `_mm256_blend_epi32` (`vpblendd`) with a zeroed vector. `vpblendd` is a single-uop instruction that runs on any port, vs. a shuffle that needs port 5 on Intel CPUs. (https://agner.org/optimize/). – Peter Cordes Jun 09 '19 at 14:54
  • 1
    Also, you might consider generating the `rot` vectors with different shuffles from the same source, or at least run 2 dep chains in parallel that you use alternately, instead of one single dep chain through a lane-crossing shuffle (3 cycle latency). That will increase ILP within a single sort. 2 dep chains limits the number of vector constants to a reasonable number, just 2: 1 for one rotate, and one for 2 rotate steps combined. – Peter Cordes Jun 09 '19 at 14:57
3

If insertion sort is reasonably competitive here, I would recommend trying a shellsort. I'm afraid 6 elements is probably just too little for it to be among the best, but it might be worth a try.

Example code, untested, undebugged, etc. You want to tune the inc = 4 and inc -= 3 sequence to find the optimum (try inc = 2, inc -= 1 for example).

static __inline__ int sort6(int * d) {
    char j, i;
    int tmp;
    for (inc = 4; inc > 0; inc -= 3) {
        for (i = inc; i < 5; i++) {
            tmp = a[i];
            j = i;
            while (j >= inc && a[j - inc] > tmp) {
                a[j] = a[j - inc];
                j -= inc;
            }
            a[j] = tmp;
        }
    }
}

I don't think this will win, but if someone posts a question about sorting 10 elements, who knows...

According to Wikipedia this can even be combined with sorting networks: Pratt, V (1979). Shellsort and sorting networks (Outstanding dissertations in the computer sciences). Garland. ISBN 0-824-04406-1

gcp
  • 715
  • 5
  • 11
2

This question is becoming quite old, but I actually had to solve the same problem these days: fast agorithms to sort small arrays. I thought it would be a good idea to share my knowledge. While I first started by using sorting networks, I finally managed to find other algorithms for which the total number of comparisons performed to sort every permutation of 6 values was smaller than with sorting networks, and smaller than with insertion sort. I didn't count the number of swaps; I would expect it to be roughly equivalent (maybe a bit higher sometimes).

The algorithm sort6 uses the algorithm sort4 which uses the algorithm sort3. Here is the implementation in some light C++ form (the original is template-heavy so that it can work with any random-access iterator and any suitable comparison function).

Sorting 3 values

The following algorithm is an unrolled insertion sort. When two swaps (6 assignments) have to be performed, it uses 4 assignments instead:

void sort3(int* array)
{
    if (array[1] < array[0]) {
        if (array[2] < array[0]) {
            if (array[2] < array[1]) {
                std::swap(array[0], array[2]);
            } else {
                int tmp = array[0];
                array[0] = array[1];
                array[1] = array[2];
                array[2] = tmp;
            }
        } else {
            std::swap(array[0], array[1]);
        }
    } else {
        if (array[2] < array[1]) {
            if (array[2] < array[0]) {
                int tmp = array[2];
                array[2] = array[1];
                array[1] = array[0];
                array[0] = tmp;
            } else {
                std::swap(array[1], array[2]);
            }
        }
    }
}

It looks a bit complex because the sort has more or less one branch for every possible permutation of the array, using 2~3 comparisons and at most 4 assignments to sort the three values.

Sorting 4 values

This one calls sort3 then performs an unrolled insertion sort with the last element of the array:

void sort4(int* array)
{
    // Sort the first 3 elements
    sort3(array);

    // Insert the 4th element with insertion sort 
    if (array[3] < array[2]) {
        std::swap(array[2], array[3]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[1] < array[0]) {
                std::swap(array[0], array[1]);
            }
        }
    }
}

This algorithm performs 3 to 6 comparisons and at most 5 swaps. It is easy to unroll an insertion sort, but we will be using another algorithm for the last sort...

Sorting 6 values

This one uses an unrolled version of what I called a double insertion sort. The name isn't that great, but it's quite descriptive, here is how it works:

  • Sort everything but the first and the last elements of the array.
  • Swap the first and the elements of the array if the first is greater than the last.
  • Insert the first element into the sorted sequence from the front then the last element from the back.

After the swap, the first element is always smaller than the last, which means that, when inserting them into the sorted sequence, there won't be more than N comparisons to insert the two elements in the worst case: for example, if the first element has been insert in the 3rd position, then the last one can't be inserted lower than the 4th position.

void sort6(int* array)
{
    // Sort everything but first and last elements
    sort4(array+1);

    // Switch first and last elements if needed
    if (array[5] < array[0]) {
        std::swap(array[0], array[5]);
    }

    // Insert first element from the front
    if (array[1] < array[0]) {
        std::swap(array[0], array[1]);
        if (array[2] < array[1]) {
            std::swap(array[1], array[2]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[4] < array[3]) {
                    std::swap(array[3], array[4]);
                }
            }
        }
    }

    // Insert last element from the back
    if (array[5] < array[4]) {
        std::swap(array[4], array[5]);
        if (array[4] < array[3]) {
            std::swap(array[3], array[4]);
            if (array[3] < array[2]) {
                std::swap(array[2], array[3]);
                if (array[2] < array[1]) {
                    std::swap(array[1], array[2]);
                }
            }
        }
    }
}

My tests on every permutation of 6 values ever show that this algorithms always performs between 6 and 13 comparisons. I didn't compute the number of swaps performed, but I don't expect it to be higher than 11 in the worst case.

I hope that this helps, even if this question may not represent an actual problem anymore :)

EDIT: after putting it in the provided benchmark, it is cleary slower than most of the interesting alternatives. It tends to perform a bit better than the unrolled insertion sort, but that's pretty much it. Basically, it isn't the best sort for integers but could be interesting for types with an expensive comparison operation.

Community
  • 1
  • 1
Morwenn
  • 21,684
  • 12
  • 93
  • 152
  • These are nice. As the problem solved is many decades old, probably as old a C programming, that the question now has nearly 5 years looks not that much relevant. – kriss Oct 07 '15 at 12:24
  • You should have a look at the way the other answers are timed. The point is that with such small dataset counting comparisons or even comparisons and swaps doesn't really say how fast is an algorithm (basically sorting 6 ints is always O(1) because O(6*6) is O(1)). The current fastest of previously proposed solutions is immediately finding the position of each value using a big comparison (by RexKerr). – kriss Oct 07 '15 at 12:30
  • @kriss Is it the fastest now? From my reading of the results, the sorting networks approach was the fastest, my bad. It's also true that my solution comes from my generic library and that I'm not always comparing integers, nor always using `operator<` for the comparison. Besides the objective count of comparisons and swaps, I also properly timed my algorithms; this solution was the fastest generic one, but I indeed missed @RexKerr's one. Gonna try it :) – Morwenn Oct 07 '15 at 12:41
  • The solution by RexKerr (Order Rank) became the fastest on X86 architecture since gcc compiler 4.2.3 (and as of gcc 4.9 became nearly two times faster than the second best). But it's heavily dependant of compiler optimisations and may not be true on other architectures. – kriss Oct 07 '15 at 14:23
  • @kriss That's interesting to know. And I could indeed more differences again with `-O3`. I guess I will adopt another strategy for my sorting library then: providing three kinds of algorithms to have either a low number of comparisons, a low number of swaps or potentially the best performance. At least, what happens will be transparent for the reader. Thanks for your insights :) – Morwenn Oct 07 '15 at 14:26
2

I found that at least on my system, the functions sort6_iterator() and sort6_iterator_local() defined below both ran at least as fast, and frequently noticeably faster, than the above current record holder:

#define MIN(x, y) (x<y?x:y)
#define MAX(x, y) (x<y?y:x)

template<class IterType> 
inline void sort6_iterator(IterType it) 
{
#define SWAP(x,y) { const auto a = MIN(*(it + x), *(it + y)); \
  const auto b = MAX(*(it + x), *(it + y)); \
  *(it + x) = a; *(it + y) = b; }
  
  SWAP(1, 2) SWAP(4, 5)
  SWAP(0, 2) SWAP(3, 5)
  SWAP(0, 1) SWAP(3, 4)
  SWAP(1, 4) SWAP(0, 3)
  SWAP(2, 5) SWAP(1, 3)
  SWAP(2, 4)
  SWAP(2, 3)
#undef SWAP
}

I passed this function a std::vector's iterator in my timing code.

That compilers can more aggressively optimize template functions might be one reason for the speed. (I also suspect that using iterators gives g++ certain assurances (that it otherwise wouldn't have) about what the memory that the iterator refers to, which facilitates optimization.) IIRC this is also part of the reason why so many STL algorithms, such as std::sort(), generally have such obscenely good performance.

Compilers can do this because template functions are NOT functions. C++ inherited many rules from C that limit what how compilers can do with code, like:

  1. limits on rearranging/changing/removing code.
  2. limits on what assumptions the compiler can make.

The C++ committee decided that templates should follow different rules that are more amenable to optimization.

For example: in void sort2(int* p, int* q){/*...*/}, the compiler can't assume that p and q point to non-overlapping memory, that p != q (since swap2(d, d) might be called somewhere), or even that p and q are not null. If it's inlined then it might figure out that p != q in a call like swap2(d,d+1). Compilers also have certain guarantees about references (ex: they're never null, unlike pointers), so they're more amenable to optimization than pointers.

Also, IIRC, compilers MUST store functions like sort2 in memory in case another source file tries calling swapMem via a function pointer. This isn't required of template functions like template<class T> void sort2(T* p, T* q){/*...*/}, which are not required to be accessible via function pointers. There are also limits to what functions can be inlined that don't apply to function templates (all implementations of the STL algorithms that I've seen have a plethora of method/function (template) calls but they still run super fast since most of them really are inlined away by the compiler).

It's important to mention that what code is fastest probably depends a lot on the optimizer. Since different code might be optimized differently, this may explain why variations of sort6 (ex: using different sorting networks, or defining MAX/MIN/SWAP differently) may have very different run-times.

For example, sort6_iterator() is sometimes (again, depending on the context in which the function is called) consistently outperformed by the following sorting function, which copies the data into local variables before sorting them.1 Since there are only 6 local variables defined, if these local variables are primitives then they are likely never actually stored in RAM and are instead only ever stored in the CPU's registers until the end of the function call, which helps make this sorting function fast. (It also helps that the compiler knows that distinct local variables have distinct locations in memory).

template<class IterType> 
inline void sort6_iterator_local(IterType it) 
{
#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
  const auto b = MAX(data##x, data##y); \
  data##x = a; data##y = b; }
//DD = Define Data
#define DD1(a)   auto data##a = *(it + a);
#define DD2(a,b) auto data##a = *(it + a), data##b = *(it + b);
//CB = Copy Back
#define CB(a) *(it + a) = data##a;
  
  DD2(1,2)    SWAP(1, 2)
  DD2(4,5)    SWAP(4, 5)
  DD1(0)      SWAP(0, 2)
  DD1(3)      SWAP(3, 5)
  SWAP(0, 1)  SWAP(3, 4)
  SWAP(1, 4)  SWAP(0, 3)   CB(0)
  SWAP(2, 5)  CB(5)
  SWAP(1, 3)  CB(1)
  SWAP(2, 4)  CB(4)
  SWAP(2, 3)  CB(2)        CB(3)
#undef CB
#undef DD2
#undef DD1
#undef SWAP
}

Note that defining SWAP() as follows sometimes results in slightly better performance although most of the time it results in slightly worse performance or a negligible difference in performance.

#define SWAP(x,y) { const auto a = MIN(data##x, data##y); \
  data##y = MAX(data##x, data##y); \
  data##x = a; }

If you just want a sorting algorithm that on primitive data types, gcc -O3 is consistently good at optimizing no matter what context the call to the sorting function appears in1 then, depending on how you pass the input, try one of the following two algorithms:

template<class T> inline void sort6(T it) {
#define SORT2(x,y) {if(data##x>data##y){auto a=std::move(data##y);data##y=std::move(data##x);data##x=std::move(a);}}
#define DD1(a)   register auto data##a=*(it+a);
#define DD2(a,b) register auto data##a=*(it+a);register auto data##b=*(it+b);
#define CB1(a)   *(it+a)=data##a;
#define CB2(a,b) *(it+a)=data##a;*(it+b)=data##b;
  DD2(1,2) SORT2(1,2)
  DD2(4,5) SORT2(4,5)
  DD1(0)   SORT2(0,2)
  DD1(3)   SORT2(3,5)
  SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
  SORT2(1,4) SORT2(0,3) CB1(0)
  SORT2(2,4) CB1(4)
  SORT2(1,3) CB1(1)
  SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}

Or if you want to pass the variables by reference then use this (the below function differs from the above in its first 5 lines):

template<class T> inline void sort6(T& e0, T& e1, T& e2, T& e3, T& e4, T& e5) {
#define SORT2(x,y) {if(data##x>data##y)std::swap(data##x,data##y);}
#define DD1(a)   register auto data##a=e##a;
#define DD2(a,b) register auto data##a=e##a;register auto data##b=e##b;
#define CB1(a)   e##a=data##a;
#define CB2(a,b) e##a=data##a;e##b=data##b;
  DD2(1,2) SORT2(1,2)
  DD2(4,5) SORT2(4,5)
  DD1(0)   SORT2(0,2)
  DD1(3)   SORT2(3,5)
  SORT2(0,1) SORT2(3,4) SORT2(2,5) CB1(5)
  SORT2(1,4) SORT2(0,3) CB1(0)
  SORT2(2,4) CB1(4)
  SORT2(1,3) CB1(1)
  SORT2(2,3) CB2(2,3)
#undef CB1
#undef CB2
#undef DD1
#undef DD2
#undef SORT2
}

The reason for using the register keyword is because this is one of the few times that you know that you want these int values in registers (if that's possible). Without register, the compiler will figure this out most of the time but sometimes it doesn't. Using the register keyword helps solve this issue. Normally, however, don't use the register keyword since it's more likely to slow your code than speed it up (ex: using all 6 registers for sort6 means they can't be used for something else, which might result in overall slower code).

  1. While timing various sorting functions I noticed that the context (i.e. surrounding code) in which the call to the sorting function was made had a significant impact on performance, which is likely due to the function being inlined and then optimized. For instance, if the program was sufficiently simple then there usually wasn't much of a difference in performance between passing the sorting function a pointer versus passing it an iterator; otherwise using iterators usually resulted in noticeably better performance and never (in my experience so far at least) any noticeably worse performance. I suspect that this may be because g++ can globally optimize sufficiently simple code.
Matthew K.
  • 464
  • 5
  • 10
1

I know this is an old question.

But I just wrote a different kind of solution I want to share.
Using nothing but nested MIN MAX,

It's not fast as it uses 114 of each,
could reduce it to 75 pretty simply like so -> pastebin

But then it's not purely min max anymore.

What might work is doing min/max on multiple integers at once with AVX

PMINSW reference

#include <stdio.h>

static __inline__ int MIN(int a, int b){
int result =a;
__asm__ ("pminsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ int MAX(int a, int b){
int result = a;
__asm__ ("pmaxsw %1, %0" : "+x" (result) : "x" (b));
return result;
}
static __inline__ unsigned long long rdtsc(void){
  unsigned long long int x;
__asm__ volatile (".byte 0x0f, 0x31" :
  "=A" (x));
  return x;
}

#define MIN3(a, b, c) (MIN(MIN(a,b),c))
#define MIN4(a, b, c, d) (MIN(MIN(a,b),MIN(c,d)))

static __inline__ void sort6(int * in) {
  const int A=in[0], B=in[1], C=in[2], D=in[3], E=in[4], F=in[5];

  in[0] = MIN( MIN4(A,B,C,D),MIN(E,F) );

  const int
  AB = MAX(A, B),
  AC = MAX(A, C),
  AD = MAX(A, D),
  AE = MAX(A, E),
  AF = MAX(A, F),
  BC = MAX(B, C),
  BD = MAX(B, D),
  BE = MAX(B, E),
  BF = MAX(B, F),
  CD = MAX(C, D),
  CE = MAX(C, E),
  CF = MAX(C, F),
  DE = MAX(D, E),
  DF = MAX(D, F),
  EF = MAX(E, F);

  in[1] = MIN4 (
  MIN4( AB, AC, AD, AE ),
  MIN4( AF, BC, BD, BE ),
  MIN4( BF, CD, CE, CF ),
  MIN3( DE, DF, EF)
  );

  const int
  ABC = MAX(AB,C),
  ABD = MAX(AB,D),
  ABE = MAX(AB,E),
  ABF = MAX(AB,F),
  ACD = MAX(AC,D),
  ACE = MAX(AC,E),
  ACF = MAX(AC,F),
  ADE = MAX(AD,E),
  ADF = MAX(AD,F),
  AEF = MAX(AE,F),
  BCD = MAX(BC,D),
  BCE = MAX(BC,E),
  BCF = MAX(BC,F),
  BDE = MAX(BD,E),
  BDF = MAX(BD,F),
  BEF = MAX(BE,F),
  CDE = MAX(CD,E),
  CDF = MAX(CD,F),
  CEF = MAX(CE,F),
  DEF = MAX(DE,F);

  in[2] = MIN( MIN4 (
  MIN4( ABC, ABD, ABE, ABF ),
  MIN4( ACD, ACE, ACF, ADE ),
  MIN4( ADF, AEF, BCD, BCE ),
  MIN4( BCF, BDE, BDF, BEF )),
  MIN4( CDE, CDF, CEF, DEF )
  );


  const int
  ABCD = MAX(ABC,D),
  ABCE = MAX(ABC,E),
  ABCF = MAX(ABC,F),
  ABDE = MAX(ABD,E),
  ABDF = MAX(ABD,F),
  ABEF = MAX(ABE,F),
  ACDE = MAX(ACD,E),
  ACDF = MAX(ACD,F),
  ACEF = MAX(ACE,F),
  ADEF = MAX(ADE,F),
  BCDE = MAX(BCD,E),
  BCDF = MAX(BCD,F),
  BCEF = MAX(BCE,F),
  BDEF = MAX(BDE,F),
  CDEF = MAX(CDE,F);

  in[3] = MIN4 (
  MIN4( ABCD, ABCE, ABCF, ABDE ),
  MIN4( ABDF, ABEF, ACDE, ACDF ),
  MIN4( ACEF, ADEF, BCDE, BCDF ),
  MIN3( BCEF, BDEF, CDEF )
  );

  const int
  ABCDE= MAX(ABCD,E),
  ABCDF= MAX(ABCD,F),
  ABCEF= MAX(ABCE,F),
  ABDEF= MAX(ABDE,F),
  ACDEF= MAX(ACDE,F),
  BCDEF= MAX(BCDE,F);

  in[4]= MIN (
  MIN4( ABCDE, ABCDF, ABCEF, ABDEF ),
  MIN ( ACDEF, BCDEF )
  );

  in[5] = MAX(ABCDE,F);
}

int main(int argc, char ** argv) {
  int d[6][6] = {
    {1, 2, 3, 4, 5, 6},
    {6, 5, 4, 3, 2, 1},
    {100, 2, 300, 4, 500, 6},
    {100, 2, 3, 4, 500, 6},
    {1, 200, 3, 4, 5, 600},
    {1, 1, 2, 1, 2, 1}
  };

  unsigned long long cycles = rdtsc();
  for (int i = 0; i < 6; i++) {
    sort6(d[i]);
  }
  cycles = rdtsc() - cycles;
  printf("Time is %d\n", (unsigned)cycles);

  for (int i = 0; i < 6; i++) {
    printf("d%d : %d %d %d %d %d %d\n", i,
     d[i][0], d[i][1], d[i][2],
     d[i][3], d[i][4], d[i][5]);
  }
}

EDIT:
Rank order solution inspired by Rex Kerr's, Much faster than the mess above

static void sort6(int *o) {
const int 
A=o[0],B=o[1],C=o[2],D=o[3],E=o[4],F=o[5];
const unsigned char
AB = A>B, AC = A>C, AD = A>D, AE = A>E,
          BC = B>C, BD = B>D, BE = B>E,
                    CD = C>D, CE = C>E,
                              DE = D>E,
a =          AB + AC + AD + AE + (A>F),
b = 1 - AB      + BC + BD + BE + (B>F),
c = 2 - AC - BC      + CD + CE + (C>F),
d = 3 - AD - BD - CD      + DE + (D>F),
e = 4 - AE - BE - CE - DE      + (E>F);
o[a]=A; o[b]=B; o[c]=C; o[d]=D; o[e]=E;
o[15-a-b-c-d-e]=F;
}
PrincePolka
  • 196
  • 1
  • 9
  • 1
    always nice to see new solutions. It looks like some easy optimisation are possible. In the end it may not prove so different from Sorting Networks. – kriss Sep 28 '17 at 20:49
  • Yes the number of MIN and MAX could possibly be reduced, for example MIN(AB, CD) repeats a few times, but reducing them alot will be hard I think. I added your test cases. – PrincePolka Sep 28 '17 at 20:53
  • pmin/maxsw operate on packed 16-bit signed integers (`int16_t`). But your C function claims it sorts an array of `int` (which is 32-bit in all C implementations that support that `asm` syntax). Did you test it with only small positive integers that have only 0 in their high halves? That will work... For `int` you need SSE4.1 `pmin/maxsd` (d = dword). https://www.felixcloutier.com/x86/pminsd:pminsq or `pminusd` for `uint32_t`. – Peter Cordes Jun 09 '19 at 15:03
1

I thought I'd try an unrolled Ford-Johnson merge-insertion sort, which achieves the minimum possible number of comparisons (ceil(log2(6!)) = 10) and no swaps. It doesn't compete, though (I got a slightly better timing than the worst sorting networks solution sort6_sorting_network_v1).

It loads the values into six registers, then performs 8 to 10 comparisons to decide which of the 720=6! cases it's in, then writes the registers back in the appropriate one of those 720 orders (separate code for each case). There are no swaps or reordering of anything until the final write-back. I haven't looked at the generated assembly code.

static inline void sort6_ford_johnson_unrolled(int *D) {
  register int a = D[0], b = D[1], c = D[2], d = D[3], e = D[4], f = D[5];
  #define abcdef(a,b,c,d,e,f) (D[0]=a, D[1]=b, D[2]=c, D[3]=d, D[4]=e, D[5]=f)
  #define abdef_cd(a,b,c,d,e,f) (c<a ? abcdef(c,a,b,d,e,f) \
                                     : c<b ? abcdef(a,c,b,d,e,f) \
                                           : abcdef(a,b,c,d,e,f))
  #define abedf_cd(a,b,c,d,e,f) (c<b ? c<a ? abcdef(c,a,b,e,d,f) \
                                           : abcdef(a,c,b,e,d,f) \
                                     : c<e ? abcdef(a,b,c,e,d,f) \
                                           : abcdef(a,b,e,c,d,f))
  #define abdf_cd_ef(a,b,c,d,e,f) (e<b ? e<a ? abedf_cd(e,a,c,d,b,f) \
                                             : abedf_cd(a,e,c,d,b,f) \
                                       : e<d ? abedf_cd(a,b,c,d,e,f) \
                                             : abdef_cd(a,b,c,d,e,f))
  #define abd_cd_ef(a,b,c,d,e,f) (d<f ? abdf_cd_ef(a,b,c,d,e,f) \
                                      : b<f ? abdf_cd_ef(a,b,e,f,c,d) \
                                            : abdf_cd_ef(e,f,a,b,c,d))
  #define ab_cd_ef(a,b,c,d,e,f) (b<d ? abd_cd_ef(a,b,c,d,e,f) \
                                     : abd_cd_ef(c,d,a,b,e,f))
  #define ab_cd(a,b,c,d,e,f) (e<f ? ab_cd_ef(a,b,c,d,e,f) \
                                  : ab_cd_ef(a,b,c,d,f,e))
  #define ab(a,b,c,d,e,f) (c<d ? ab_cd(a,b,c,d,e,f) \
                               : ab_cd(a,b,d,c,e,f))
  a<b ? ab(a,b,c,d,e,f)
      : ab(b,a,c,d,e,f);
  #undef ab
  #undef ab_cd
  #undef ab_cd_ef
  #undef abd_cd_ef
  #undef abdf_cd_ef
  #undef abedf_cd
  #undef abdef_cd
  #undef abcdef
}

TEST(ford_johnson_unrolled,   "Unrolled Ford-Johnson Merge-Insertion sort");
Don Hatch
  • 5,041
  • 3
  • 31
  • 48
  • The idea of making the minimal number of comparisons and using that to pick the right variable ordering is also the basis for Rank Order. Looks like if avoiding the swap is nice, having 10 branches and 720 code path is not cheap. – kriss Nov 24 '21 at 00:59
  • @kriss It looks somewhat similar, however, I don't think the Rank Order based solutions do a minimal number of comparisons, do they? Look like one of them does 25 comparisons, another does 15. Also the assignment at the end of Rank Order goes through indirection. Rank order wins anyway, of course, but I wonder if my method here will win on future machines with tons more instruction cache or other resources. – Don Hatch Nov 24 '21 at 01:18
  • branches when implemented as jumps are likely the most costly possible CPU feature because it empties all caches and anticipated execution pipelines. I don't see any evolution that would ever make it cheap, especially with 720 unique code paths. A single test can be cheap because it can be implemented branchless as a conditional assignment. The core idea of rank order is to perform tests but without actually branching. The trouble here is likely follow-up of each minimal test by a conditional branch. But I don't see how it can be avoided and keep comparisons minimal. – kriss Dec 21 '21 at 00:17
  • @kriss the "future machine" scenario I'm thinking of is just this: https://en.wikipedia.org/wiki/Speculative_execution#Eager_execution . "With unlimited resources, eager execution ... would in theory provide the same performance as perfect branch prediction". – Don Hatch Dec 21 '21 at 04:31
  • I understand, but I don't believe in actual feasability of it at least at hardware level. Even branch prediction is not efficient today when prediction fails. Of course we can imagine running 720 processors on the same code and only one of them keeping the result, but to spent so much ressource we have to imagine a use case where any minor speed improvement is more important than any ressources used. And also that selecting the right result has a really small cost. – kriss Dec 21 '21 at 14:23
  • Also if we go to unlimited ressource path, we can imagine all orderings being tested at once, the total number or comparisons in the souce code doesn't matter. I can imagine 720 parallel code path each performing 5 comparizons (to check the result is ordered). It's still a waste or ressources but it's simple at hardware level (no combinatory explosion, code dedicated to solve this specific problem). – kriss Dec 21 '21 at 14:27
1

I know the party's over 12 years ago, but still.

I've found the best way to sort a handful of items using the parallel iteration of

shifted = [0 sorted](1 : length(sorted));
shifted = max(shifted, new_value);
sorted = min(sorted, shifted);

Starting with sorted = ones(1, N) * max_value;

This is going to SIMDify quite nicely even with

using vec4 = int32_t __attribute__((vector_size(16));

int32_t max = 0x7fffffff;

vec4 sorted_lo = { max, max, max, max }, sorted_hi = sorted_lo;
sorted_lo[0] = data[0];  // "pre-sort" the first element directly

for (int i = 1 ; i < 6; i++) {
 vec4 new_value{ data[i], data[i], data[i], data[i] };
 vec4 shifted_lo = { 0, sorted_lo[0], sorted_lo[1], sorted_lo[2]};
 vec4 shifted_hi = { sorted_lo[3],sorted_hi[0],sorted_hi[1],sorted_hi[2]};

 shifted_lo = max(shifted_lo, new_value0);
 shifted_hi = max(shifted_hi, new_value0);

 sorted_lo = max(sorted_lo, shifted_lo);
 sorted_hi = max(sorted_hi, shifted_hi);
}

Many SIMD architectures (SSE4, ARM64) contain the min/max operations, as well as the shifting between two vectors, and duplication of a single lane. In AVX2 one can not efficiently shift between the lanes, but one can sort two independent vectors of 5-8 elements.

A greatly improved version of this will use bitonic merge sort with two vectors. The first vector uses the insertion sort trick for 3 first values e.g. from 1 8 4 3 5 4 as in 1 4 8 inf, the other vector sorted for inf 5 4 3.

A bitonic merge stage will then sort A = min(a,b); B = max(a,b), then sort the A[0]<->A[2], A[1]<->A[3], B[0]<->B[2], B[1]<->B[3] in parallel and finally the last stage A[0]<->A[1], A[2]<->A[3],B[0]<->B[1], B[2]<->B[3].

With the example values, this would mean

  A = 1 4 8 inf, B = inf 5 4 3
      <--------------->
        <---------------->
          <---------------->
             <--------------->
  A = 1 4 4 3  , B = inf 5 8 inf
      <--->          <----->
        <--->            <---->
  A = 1 3 4 4  , B = 8   5 inf inf
      <-> <->        <---> <---->
  A = 1 3 4 5  , B = 5   8 inf inf

Probably the most difficult step there is A[0,2], A[1,3], B[0,2], B[1,3] stage, requiring some shuffling. In contrast at least ARM64 NEON has an instruction to extract the pairwise minimum/maximum.

To answer the comment, it's generally not an issue to transfer data to vector registers. There might be longer latency when retrieving the data either from SIMD registers or from memory. For SSE2, I would e.g. try to place the 6 elements in a vector of

int32_t x[8] = {inf, 0,0,0,0,0,0, inf};

which would allow fast shuffling to (pshufd / _mm_shuffle_epi32) to get the first sorted vectors of val[1], inf, inf, inf vs inf, inf, inf, val[4] and also broadcasting any lane over an XXM registers.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
  • sorting never get out of fashion. :-) Of course it should be measured anyway, because for small number of items loading the data into the vector registers may cost more than the speedup. – kriss May 25 '23 at 19:03
1

I believe there are two parts to your question.

  • The first is to determine the optimal algorithm. This is done - at least in this case - by looping through every possible ordering (there aren't that many) which allows you to compute exact min, max, average and standard deviation of compares and swaps. Have a runner-up or two handy as well.
  • The second is to optimize the algorithm. A lot can be done to convert textbook code examples to mean and lean real-life algorithms. If you realize that an algorithm can't be optimized to the extent required, try a runner-up.

I wouldn't worry too much about emptying pipelines (assuming current x86): branch prediction has come a long way. What I would worry about is making sure that the code and data fit in one cache line each (maybe two for the code). Once there fetch latencies are refreshingly low which will compensate for any stall. It also means that your inner loop will be maybe ten instructions or so which is right where it should be (there are two different inner loops in my sorting algorithm, they are 10 instructions/22 bytes and 9/22 long respectively). Assuming the code doesn't contain any divs you can be sure it will be blindingly fast.

Olof Forshell
  • 3,169
  • 22
  • 28
  • I'm not sure how to understand your answer. First I don't understand at all what algorithm you are proposing ? And how it could be optimal if you have to loop through 720 possible orderings (existing answers takes much less than 720 cycles). If you have random input I can't imagine (even on theoretical level) how branch prediction could perform better than 50-50 except if it doesn't care at all of input data. Also most good solutions already proposed are already likely to work with both data and code fully in cache. But maybe I completely misunderstood your answer. Mind showing some code ? – kriss Mar 06 '12 at 23:33
  • What I meant was that there are only 720 (6!) different combinations of 6 integers and by running all of them through the candidate algorithms you can determine a lot of things as I mentioned - that's the theoretical part. The practical part is fine-tuning that algorithm to run in as few clock cycles as possible. My starting point for sorting 6 integers is a 1, 4 gap shellsort. The 4 gap paves the way for good branch prediction in the 1 gap. – Olof Forshell Mar 07 '12 at 08:05
  • The 1, 4 gap shellsort for 6! unique combinations (beginning with 012345 and ending with 543210) will have a best case of 7 comparisons and 0 exchanges and a worst of 14 comparisons and 10 exchanges. The average case is about 11.14 comparisons and 6 exchanges. – Olof Forshell Mar 07 '12 at 14:55
  • Mistake in the average sort: it is 10.63 comparisons and 6 exchanges. – Olof Forshell Mar 08 '12 at 09:40
  • Are you not forgetting to count comparisons used for ending the loop ? – kriss Mar 08 '12 at 22:31
  • The number of comparisons are those required for comparing the actual (integer in this case) data. If the data access is more complex (such as for a string at an offset in a row in a table) any other conditional conditional test (such as the one you mention) will be trivial in comparison. In this case the tests are equivalent processing-wise but for determining generic sort-algorithn efficiency you count the comparisons (and exchanges) of the data to be sorted. – Olof Forshell Mar 09 '12 at 06:09
  • I understand what you say, but you are answering to another question I haven't asked. This one is about sorting an array of 6 int using usual int comparison operator. I pointed your count out because it means you are assuming as "no cost" operations that are not no cost in this case. You also assume regular random distribution of data to compute the count which may or may not be the case. I may be wrong but my raw guess is that gap sort will perform roughly like insertion sort. Some solutions proposed here are 3 times faster. – kriss Mar 14 '12 at 08:24
  • 1
    I don't get the "regular random distribution" - what I am doing is testing every possible combination and determining min/average/max stats. Shellsort is a series of insertion sorts of decreasing increments such that the final increment - 1 - does a lot less work than if it is performed all alone as in a pure insertion sort. As to clock count my algorithm requires 406 clock cycles on average and this includes collecting statistics and doing two calls to the actual sorting routine - one for each gap. This is on an Athlon M300 mobile, compiler OpenWatcom. – Olof Forshell Mar 14 '12 at 09:11
  • 1
    "regular random distribution" means every combinations of actual data that is sorted may not be of equal probability. If every combinations are not of equal probability your stats are broken because average needs to take into account how many times a given distribution is likely to occur. For the clock count, if you try out any other implementation of this sort (links provided above) and run it on your test system we will have a basis for comparison and see how well your chosen one perform. – kriss Mar 19 '12 at 00:43
0

Try 'merging sorted list' sort. :) Use two array. Fastest for small and big array.
If you concating, you only check where insert. Other bigger values you not need compare (cmp = a-b>0).
For 4 numbers, you can use system 4-5 cmp (~4.6) or 3-6 cmp (~4.9). Bubble sort use 6 cmp (6). Lots of cmp for big numbers slower code.
This code use 5 cmp (not MSL sort):

if (cmp(arr[n][i+0],arr[n][i+1])>0) {swap(n,i+0,i+1);}
if (cmp(arr[n][i+2],arr[n][i+3])>0) {swap(n,i+2,i+3);}
if (cmp(arr[n][i+0],arr[n][i+2])>0) {swap(n,i+0,i+2);}
if (cmp(arr[n][i+1],arr[n][i+3])>0) {swap(n,i+1,i+3);}
if (cmp(arr[n][i+1],arr[n][i+2])>0) {swap(n,i+1,i+2);}`

Principial MSL

9 8 7 6 5 4 3 2 1 0
89 67 45 23 01 ... concat two sorted lists, list length = 1
6789 2345 01 ... concat two sorted lists, list length = 2
23456789 01 ... concat two sorted lists, list length = 4
0123456789 ... concat two sorted lists, list length = 8

js code

function sortListMerge_2a(cmp)  
{
var step, stepmax, tmp, a,b,c, i,j,k, m,n, cycles;
var start = 0;
var end   = arr_count;
//var str = '';
cycles = 0;
if (end>3)
    {
    stepmax = ((end - start + 1) >> 1) << 1;
    m = 1;
    n = 2;
    for (step=1;step<stepmax;step<<=1)  //bounds 1-1, 2-2, 4-4, 8-8...
        {
        a = start;
        while (a<end)
            {
            b = a + step;
            c = a + step + step;
            b = b<end ? b : end;
            c = c<end ? c : end;
            i = a;
            j = b;
            k = i;
            while (i<b && j<c)
                {
                if (cmp(arr[m][i],arr[m][j])>0)
                    {arr[n][k] = arr[m][j]; j++; k++;}
                else    {arr[n][k] = arr[m][i]; i++; k++;}
                }
            while (i<b)
                {arr[n][k] = arr[m][i]; i++; k++;
}
            while (j<c)
                {arr[n][k] = arr[m][j]; j++; k++;
}
            a = c;
            }
        tmp = m; m = n; n = tmp;
        }
    return m;
    }
else
    {
    // sort 3 items
    sort10(cmp);
    return m;
    }
}
General Grievance
  • 4,555
  • 31
  • 31
  • 45
peter
  • 31
  • 2
0

Maybe I am late to the party, but at least my contribution is a new approach.

  • The code really should be inlined
  • even if inlined, there are too many branches
  • the analysing part is basically O(N(N-1)) which seems OK for N=6
  • the code could be more effective if the cost of swap would be higher (irt the cost of compare)
  • I trust on static functions being inlined.
  • The method is related to rank-sort
    • instead of ranks, the relative ranks (offsets) are used.
    • the sum of the ranks is zero for every cycle in any permutation group.
    • instead of SWAP()ing two elements, the cycles are chased, needing only one temp, and one (register->register) swap (new <- old).

Update: changed the code a bit, some people use C++ compilers to compile C code ...

#include <stdio.h>

#if WANT_CHAR
typedef signed char Dif;
#else
typedef signed int Dif;
#endif

static int walksort (int *arr, int cnt);
static void countdifs (int *arr, Dif *dif, int cnt);
static void calcranks(int *arr, Dif *dif);

int wsort6(int *arr);

void do_print_a(char *msg, int *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
        fprintf(stderr, " %3d", *arr);
        }
fprintf(stderr,"\n");
}

void do_print_d(char *msg, Dif *arr, unsigned cnt)
{
fprintf(stderr,"%s:", msg);
for (; cnt--; arr++) {
        fprintf(stderr, " %3d", (int) *arr);
        }
fprintf(stderr,"\n");
}

static void inline countdifs (int *arr, Dif *dif, int cnt)
{
int top, bot;

for (top = 0; top < cnt; top++ ) {
        for (bot = 0; bot < top; bot++ ) {
                if (arr[top] < arr[bot]) { dif[top]--; dif[bot]++; }
                }
        }
return ;
}
        /* Copied from RexKerr ... */
static void inline calcranks(int *arr, Dif *dif){

dif[0] =     (arr[0]>arr[1])+(arr[0]>arr[2])+(arr[0]>arr[3])+(arr[0]>arr[4])+(arr[0]>arr[5]);
dif[1] = -1+ (arr[1]>=arr[0])+(arr[1]>arr[2])+(arr[1]>arr[3])+(arr[1]>arr[4])+(arr[1]>arr[5]);
dif[2] = -2+ (arr[2]>=arr[0])+(arr[2]>=arr[1])+(arr[2]>arr[3])+(arr[2]>arr[4])+(arr[2]>arr[5]);
dif[3] = -3+ (arr[3]>=arr[0])+(arr[3]>=arr[1])+(arr[3]>=arr[2])+(arr[3]>arr[4])+(arr[3]>arr[5]);
dif[4] = -4+ (arr[4]>=arr[0])+(arr[4]>=arr[1])+(arr[4]>=arr[2])+(arr[4]>=arr[3])+(arr[4]>arr[5]);
dif[5] = -(dif[0]+dif[1]+dif[2]+dif[3]+dif[4]);
}

static int walksort (int *arr, int cnt)
{
int idx, src,dst, nswap;

Dif difs[cnt];

#if WANT_REXK
calcranks(arr, difs);
#else
for (idx=0; idx < cnt; idx++) difs[idx] =0;
countdifs(arr, difs, cnt);
#endif
calcranks(arr, difs);

#define DUMP_IT 0
#if DUMP_IT
do_print_d("ISteps ", difs, cnt);
#endif

nswap = 0;
for (idx=0; idx < cnt; idx++) {
        int newval;
        int step,cyc;
        if ( !difs[idx] ) continue;
        newval = arr[idx];
        cyc = 0;
        src = idx;
        do      {
                int oldval;
                step = difs[src];
                difs[src] =0;
                dst = src + step;
                cyc += step ;
                if(dst == idx+1)idx=dst;
                oldval = arr[dst];
#if (DUMP_IT&1)
                fprintf(stderr, "[Nswap=%d] Cyc=%d Step=%2d Idx=%d  Old=%2d New=%2d #### Src=%d Dst=%d[%2d]->%2d <-- %d\n##\n"
                        , nswap, cyc, step, idx, oldval, newval
                        , src, dst, difs[dst], arr[dst]
                        , newval  );
                do_print_a("Array ", arr, cnt);
                do_print_d("Steps ", difs, cnt);
#endif

                arr[dst] = newval;
                newval = oldval;
                nswap++;
                src = dst;
                } while( cyc);
        }

return nswap;
}
/*************/
int wsort6(int *arr)
{
return walksort(arr, 6);
}
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • looks like a bubble sort. Potentially a good contender for the slowest implementation, but it can be still be of some interest to know if working on the code makes that much difference. Please put your code at the same format as others, thus we can run the benchmark on it. – kriss Feb 01 '18 at 10:53
  • @kriss https://en.wikipedia.org/wiki/Permutation_group It is certainly *not* bubble sort: the code detects cycles in the given permutation, and walks these cycles, putting each element at its final place. The final `wsort6()` function has the correct interface. – joop Feb 01 '18 at 11:18
  • @joop: my bad, no bubble sort indeed. That being said in the context I'm still expecting the code to be much worse than any other current implementation. By the way the Rank Order solution is optimal regarding number of swaps as it directly find the final position of every items. It's also unclear if walksort even works when we remove the hypothesis that all sorted numbers are different like here. To benchmark the code we should the trace code. Also as I'm usually compiling on a C++ compiler, the code won't work because the OP called a variable "new" (and thats breaks syntax highlighting). – kriss Feb 01 '18 at 17:04
  • The method is very close to rank order, only the final assignments are done *in place* . Apart from the ranks `o1..o5`, there is no need for the second temp `e[6]` array. And: compiling C code on a C++ compiler, and blaming the code ? – joop Feb 01 '18 at 17:45
  • @greybeard: thank you,I added a space before `#include` . Fixed – wildplasser Feb 01 '18 at 23:56
  • 1
    Your code indentation sure is something else (try, for example, get [indent(1)](http://man.openbsd.org/indent.1) to produce it): where did you get it from? – greybeard Feb 02 '18 at 06:21
0
//Bruteforce compute unrolled count dumbsort(min to 0-index)
void bcudc_sort6(int* a)
{
    int t[6] = {0};
    int r1,r2;

    r1=0;
    r1 += (a[0] > a[1]);
    r1 += (a[0] > a[2]);
    r1 += (a[0] > a[3]);
    r1 += (a[0] > a[4]);
    r1 += (a[0] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[0];

    r2=0;
    r2 += (a[1] > a[0]);
    r2 += (a[1] > a[2]);
    r2 += (a[1] > a[3]);
    r2 += (a[1] > a[4]);
    r2 += (a[1] > a[5]);
    while(t[r2]){r2++;} 
    t[r2] = a[1];

    r1=0;
    r1 += (a[2] > a[0]);
    r1 += (a[2] > a[1]);
    r1 += (a[2] > a[3]);
    r1 += (a[2] > a[4]);
    r1 += (a[2] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[2];

    r2=0;
    r2 += (a[3] > a[0]);
    r2 += (a[3] > a[1]);
    r2 += (a[3] > a[2]);
    r2 += (a[3] > a[4]);
    r2 += (a[3] > a[5]);
    while(t[r2]){r2++;} 
    t[r2] = a[3];

    r1=0;
    r1 += (a[4] > a[0]);
    r1 += (a[4] > a[1]);
    r1 += (a[4] > a[2]);
    r1 += (a[4] > a[3]);
    r1 += (a[4] > a[5]);
    while(t[r1]){r1++;}
    t[r1] = a[4];

    r2=0;
    r2 += (a[5] > a[0]);
    r2 += (a[5] > a[1]);
    r2 += (a[5] > a[2]);
    r2 += (a[5] > a[3]);
    r2 += (a[5] > a[4]);
    while(t[r2]){r2++;} 
    t[r2] = a[5];

    a[0]=t[0];
    a[1]=t[1];
    a[2]=t[2];
    a[3]=t[3];
    a[4]=t[4];
    a[5]=t[5];
}

static __inline__ void sort6(int* a)
{
    #define wire(x,y); t = a[x] ^ a[y] ^ ( (a[x] ^ a[y]) & -(a[x] < a[y]) ); a[x] = a[x] ^ t; a[y] = a[y] ^ t;
    register int t;

    wire( 0, 1); wire( 2, 3); wire( 4, 5);
    wire( 3, 5); wire( 0, 2); wire( 1, 4);
    wire( 4, 5); wire( 2, 3); wire( 0, 1); 
    wire( 3, 4); wire( 1, 2); 
    wire( 2, 3);

    #undef wire
}
FranG
  • 116
  • 6
  • Regardless of speed are you sure it works ? In bruteforce sort your loops are dubious. Seems to me they won't work if we have a zero in sorted values. – kriss Feb 20 '19 at 16:31
  • 1
    t[6] array is initialized to 0x0. So it doesn't matter where and if a 0x0 valued key will be written. – FranG Feb 20 '19 at 20:05
-1

Sort 4 items with usage cmp==0. Numbers of cmp is ~4.34 (FF native have ~4.52), but take 3x time than merging lists. But better less cmp operations, if you have big numbers or big text. Edit: repaired bug

Online test http://mlich.zam.slu.cz/js-sort/x-sort-x2.htm

function sort4DG(cmp,start,end,n) // sort 4
{
var n     = typeof(n)    !=='undefined' ? n   : 1;
var cmp   = typeof(cmp)  !=='undefined' ? cmp   : sortCompare2;
var start = typeof(start)!=='undefined' ? start : 0;
var end   = typeof(end)  !=='undefined' ? end   : arr[n].length;
var count = end - start;
var pos = -1;
var i = start;
var cc = [];
// stabilni?
cc[01] = cmp(arr[n][i+0],arr[n][i+1]);
cc[23] = cmp(arr[n][i+2],arr[n][i+3]);
if (cc[01]>0) {swap(n,i+0,i+1);}
if (cc[23]>0) {swap(n,i+2,i+3);}
cc[12] = cmp(arr[n][i+1],arr[n][i+2]);
if (!(cc[12]>0)) {return n;}
cc[02] = cc[01]==0 ? cc[12] : cmp(arr[n][i+0],arr[n][i+2]);
if (cc[02]>0)
    {
    swap(n,i+1,i+2); swap(n,i+0,i+1); // bubble last to top
    cc[13] = cc[23]==0 ? cc[12] : cmp(arr[n][i+1],arr[n][i+3]);
    if (cc[13]>0)
        {
        swap(n,i+2,i+3); swap(n,i+1,i+2); // bubble
        return n;
        }
    else    {
    cc[23] = cc[23]==0 ? cc[12] : (cc[01]==0 ? cc[30] : cmp(arr[n][i+2],arr[n][i+3]));  // new cc23 | c03 //repaired
        if (cc[23]>0)
            {
            swap(n,i+2,i+3);
            return n;
            }
        return n;
        }
    }
else    {
    if (cc[12]>0)
        {
        swap(n,i+1,i+2);
        cc[23] = cc[23]==0 ? cc[12] : cmp(arr[n][i+2],arr[n][i+3]); // new cc23
        if (cc[23]>0)
            {
            swap(n,i+2,i+3);
            return n;
            }
        return n;
        }
    else    {
        return n;
        }
    }
return n;
}
peter
  • 31
  • 2
  • 1
    The use case is slightly different from the initial context of the question. With fixed length sorts details matter and counting cmp of swaps is not enough. I wouldn't even be suprised if it werent't the actual sort at all that would consume time, but something completely different light calling typeof() in the init. I don't know how to perform actual clock time mesure using Javascript. Maybe with node ? – kriss Jan 13 '17 at 13:50
-1

Well, if it's only 6 elements and you can leverage parallelism, want to minimize conditional branching, etc. Why you don't generate all the combinations and test for order? I would venture that in some architectures, it can be pretty fast (as long as you have the memory preallocated)

GClaramunt
  • 3,148
  • 1
  • 21
  • 35
  • 9
    There are 720 orderings, and the fast versions are well under 100 cycles. Even if massive parallelism could be leverage, at such a small time scale the cost of creating and synchronization the threads would likely exceed the cost of just sorting the arrays on one core. – Kevin Stock Aug 24 '11 at 15:19