Fastest sort algorithm for millions of UINT64 RGBZ graphics pixels

Question

I am sorting 10+ million uint64_ts with RGB data from .RAW files and 79% of my C program time is spent in qsort. I am looking for a faster sort for this specific data type.

Being RAW graphical data, the numbers are very random and ~80% unique. No partial sorting or runs of sorted data can be expected. The 4 uint16_ts inside the uint64_t are R, G, B and zero (possibly a small count <= ~20).

I have the simplest comparison function I can think of using unsigned long longs (you CANNOT just subtract them):

qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); 
...
int comp_uint64(const void *a, const void *b)  {
    if(*((uint64_t *)a) > *((uint64_t *)b))  return(+1);
    if(*((uint64_t *)a) < *((uint64_t *)b))  return(-1);
    return(0);
}  // End Comp_uint64().

There was a very interesting "Programming Puzzles & Code Golf" on StackExchange, but they used floats. Then there are QSort, RecQuick, heap, stooge, tree, radix...

The swenson/sort looked interesting but had no (obvious) support for my datatype, uint64_t. And the "quick sort" time was the best. Some sources say the system qsort can be anything, not necessarily "Quick Sort".

A C++ sort bypasses the generic casting of void pointers and realizes great improvements in performance over C. There has to be an optimized method to slam U8s through a 64bit processor at warp speed.

System/compiler info:

I am currently using the GCC with Strawberry Perl

gcc version 4.9.2 (x86_64-posix-sjlj, built by strawberryperl.com
Intel 2700K Sandy Bridge CPU, 32GB DDR3
windows 7/64 pro

gcc -D__USE_MINGW_ANSI_STDIO -O4 -ffast-math -m64 -Ofast -march=corei7-avx -mtune=corei7 -Ic:/bin/xxHash-master -Lc:/bin/xxHash-master c:/bin/stddev.c -o c:/bin/stddev.g6.exe

First attempt at a better `qsort`, `QSORT()`!

Tried to use Michael Tokarev's inline qsort.

"READY-TO-USE"? From qsort.h documentation

-----------------------------
* Several ready-to-use examples:
 *
 * Sorting array of integers:
 * void int_qsort(int *arr, unsigned n) {
 * #define int_lt(a,b) ((*a)<(*b))
 *   QSORT(int, arr, n, int_lt);
--------------------------------

Change from type "int" to "uint64_t"
compile error on TYPE???
    
    c:/bin/bpbfct.c:586:8: error: expected expression before 'uint64_t'
      QSORT(uint64_t, hpidx, num_pix, islt);

I can't find a real, compiling, working example program, just comments with the "general concept"

#define QSORT_TYPE uint64_t 
#define islt(a,b) ((*a)<(*b))

uint64_t *QSORT_BASE; 
int QSORT_NELT;

hpidx=(uint64_t *) calloc(num_pix+2, sizeof(uint64_t));  // Hash . PIDX
QSORT_BASE = hpidx;
QSORT_NELT = num_pix;  // QSORT_LT is function QSORT_LT()
QSORT(uint64_t, hpidx, num_pix, islt);  
//QSORT(uint64_t *, hpidx, num_pix, QSORT_LT);  // QSORT_LT mal-defined?
//qsort(hpidx, num_pix, sizeof(uint64_t), comp_uint64); // << WORKS

The "ready-to-use" examples use types of int, char * and struct elt. Isn't uint64_t a type?? Try long long

QSORT(long long, hpidx, num_pix, islt); 
c:/bin/bpbfct.c:586:8: error: expected expression before 'long'
 QSORT(long long, hpidx, num_pix, islt);

Next attempt: `RADIXSORT`:

Results: RADIX_SORT is RADICAL!

  I:\br3\pf.249465>grep "Event" bb12.log | grep -i Sort       
 << 1.40 sec average
4) Time=1.411 sec    = 49.61%, Event RADIX_SORT        , hits=1
4) Time=1.396 sec    = 49.13%, Event RADIX_SORT        , hits=1
4) Time=1.392 sec    = 49.15%, Event RADIX_SORT        , hits=1
16) Time=1.414 sec    = 49.12%, Event RADIX_SORT        , hits=1

I:\br3\pf.249465>grep "Event" bb11.log | grep -i Sort 
 << 5.525 sec average  = 3.95 time slower
4) Time=5.538 sec    = 86.34%, Event QSort             , hits=1
4) Time=5.519 sec    = 79.41%, Event QSort             , hits=1
4) Time=5.519 sec    = 79.02%, Event QSort             , hits=1
4) Time=5.563 sec    = 79.49%, Event QSort             , hits=1
4) Time=5.684 sec    = 79.83%, Event QSort             , hits=1
4) Time=5.509 sec    = 79.30%, Event QSort             , hits=1

3.94 times faster than whatever sort qsort out of the box uses!

And, even more importantly, there was actual, working code, not just 80% of what you need given by some Guru who assumes you know everything they know and can fill in the other 20%.

Fantastic solution! Thanks Louis Ricci!

if your data is as random as you described, then i would say qsort would be one of the implementations with the stablest performance already. — Jason Hu, Aug 24 '15 at 17:52
Can you simply use the C++ sort? You can put it into a separate .cpp file with an `extern "C"` so the remainder of your code can stay in C. — Adam, Aug 24 '15 at 17:54
@user3386109 You're thinking of RGBA. I think the Z refers to depth information. Either way, it's irrelevant to the question. — Adam, Aug 24 '15 at 17:55
[counting sort](https://en.wikipedia.org/wiki/Counting_sort)? maybe — Nikos M., Aug 24 '15 at 18:00
RGBZ is for RGB_ZERO_ I smash 3 UINT16s into a UINT64 because a UINT32 is too small and neither K nor R had digital cameras with 16 bit quanta, there is no UINT48. What an egregious oversight! I clear all 8 bytes before loading with RGB. I also add counts and other stuff to the last 2 bytes. — BrianP007, Aug 24 '15 at 19:11
Qsort performance is dominated by the function calling overhead. Create your own sort, and use an inlinable compare function. — wildplasser, Aug 24 '15 at 21:15
The A in RGB_A_ is usually for ALPHA channel which is either OPACITY or TRANSPARENCY: https://en.wikipedia.org/wiki/RGBA_color_space — BrianP007, Aug 24 '15 at 21:57
@BrianP007 - Since 16bits of the 64bits are zero you can comment out the 2 for loops under the *//radix* comment that deal with those bits and get another ~20% speedup. — Louis Ricci, Aug 25 '15 at 11:43
the compare function that you posted sorts first by the intensity of the color red then by the color blue then by the color green then by the transparency. What do you want the sort criteria to be? — user3629249, Aug 25 '15 at 13:38

score 8 · Accepted Answer · edited May 23 '17 at 12:34

I would use Radix Sort with an 8bit radix. For 64bit values a well optimized radix sort will have to iterate over the list 9 times (one to precalculate the counts and offsets and 8 for 64bits/8bits). 9*N time and 2*N space (using a shadow array).

Here's what an optimized radix sort would look like.

typedef union {
    struct {
        uint32_t c8[256];
        uint32_t c7[256];
        uint32_t c6[256];
        uint32_t c5[256];
        uint32_t c4[256];
        uint32_t c3[256];
        uint32_t c2[256];
        uint32_t c1[256];
    };
    uint32_t counts[256 * 8];
} rscounts_t;

uint64_t * radixSort(uint64_t * array, uint32_t size) {
    rscounts_t counts;
    memset(&counts, 0, 256 * 8 * sizeof(uint32_t));
    uint64_t * cpy = (uint64_t *)malloc(size * sizeof(uint64_t));
    uint32_t o8=0, o7=0, o6=0, o5=0, o4=0, o3=0, o2=0, o1=0;
    uint32_t t8, t7, t6, t5, t4, t3, t2, t1;
    uint32_t x;
    // calculate counts
    for(x = 0; x < size; x++) {
        t8 = array[x] & 0xff;
        t7 = (array[x] >> 8) & 0xff;
        t6 = (array[x] >> 16) & 0xff;
        t5 = (array[x] >> 24) & 0xff;
        t4 = (array[x] >> 32) & 0xff;
        t3 = (array[x] >> 40) & 0xff;
        t2 = (array[x] >> 48) & 0xff;
        t1 = (array[x] >> 56) & 0xff;
        counts.c8[t8]++;
        counts.c7[t7]++;
        counts.c6[t6]++;
        counts.c5[t5]++;
        counts.c4[t4]++;
        counts.c3[t3]++;
        counts.c2[t2]++;
        counts.c1[t1]++;
    }
    // convert counts to offsets
    for(x = 0; x < 256; x++) {
        t8 = o8 + counts.c8[x];
        t7 = o7 + counts.c7[x];
        t6 = o6 + counts.c6[x];
        t5 = o5 + counts.c5[x];
        t4 = o4 + counts.c4[x];
        t3 = o3 + counts.c3[x];
        t2 = o2 + counts.c2[x];
        t1 = o1 + counts.c1[x];
        counts.c8[x] = o8;
        counts.c7[x] = o7;
        counts.c6[x] = o6;
        counts.c5[x] = o5;
        counts.c4[x] = o4;
        counts.c3[x] = o3;
        counts.c2[x] = o2;
        counts.c1[x] = o1;
        o8 = t8; 
        o7 = t7; 
        o6 = t6; 
        o5 = t5; 
        o4 = t4; 
        o3 = t3; 
        o2 = t2; 
        o1 = t1;
    }
    // radix
    for(x = 0; x < size; x++) {
        t8 = array[x] & 0xff;
        cpy[counts.c8[t8]] = array[x];
        counts.c8[t8]++;
    }
    for(x = 0; x < size; x++) {
        t7 = (cpy[x] >> 8) & 0xff;
        array[counts.c7[t7]] = cpy[x];
        counts.c7[t7]++;
    }
    for(x = 0; x < size; x++) {
        t6 = (array[x] >> 16) & 0xff;
        cpy[counts.c6[t6]] = array[x];
        counts.c6[t6]++;
    }
    for(x = 0; x < size; x++) {
        t5 = (cpy[x] >> 24) & 0xff;
        array[counts.c5[t5]] = cpy[x];
        counts.c5[t5]++;
    }
    for(x = 0; x < size; x++) {
        t4 = (array[x] >> 32) & 0xff;
        cpy[counts.c4[t4]] = array[x];
        counts.c4[t4]++;
    }
    for(x = 0; x < size; x++) {
        t3 = (cpy[x] >> 40) & 0xff;
        array[counts.c3[t3]] = cpy[x];
        counts.c3[t3]++;
    }
    for(x = 0; x < size; x++) {
        t2 = (array[x] >> 48) & 0xff;
        cpy[counts.c2[t2]] = array[x];
        counts.c2[t2]++;
    }
    for(x = 0; x < size; x++) {
        t1 = (cpy[x] >> 56) & 0xff;
        array[counts.c1[t1]] = cpy[x];
        counts.c1[t1]++;
    }
    free(cpy);
    return array;
}

EDIT this implementation was based on a JavaScript version Fastest way to sort 32bit signed integer arrays in JavaScript?

Here's the IDEONE for the C radix sort http://ideone.com/JHI0d9

Radix sort doesn't have too many uses, but it's nearly perfect for sorting pixels. Unfortunately the one time I had to sort pixels I didn't realize it until I had implemented something else. — Mark Ransom, Aug 24 '15 at 21:02
with 'size' being in the millions, that 'malloc' will be huge. — user3629249, Aug 25 '15 at 13:41
maybe a simpler counting sort (of similar `O(n)` complexity) would do as well — Nikos M., Aug 25 '15 at 15:20
@user3629249 - 10million uint64's * 8 bytes = 80million bytes ~= 80MB, I think a modern computer can handle an 80MB allocation. There are inplace versions of radix sort to avoid the the shadow array allocation but this version seems fast enough. — Louis Ricci, Aug 25 '15 at 16:54
@NikosM. - I think you'd find the counting sort would turn into a radix sort. The elements being sorted are uint64 values so for counting sort you'd need a count array of size 2^64 (for this specific problem 16bits of those 64 are zero so you'd only need 2^48 which is still huge). — Louis Ricci, Aug 25 '15 at 17:01
Ive had a look at this implementation, and Im not certain I understand why array is being returned, since it is a input pointer in the first place? Ie, the array will be written to. — David Lannan, Aug 31 '20 at 06:53
@Anush -Yes, using larger radix like 16bit instead of 8 tend to be slower. The larger "counts" arrays 65536 vs 256 are prone to more cache misses when loading data from them during each pass. — Louis Ricci, Jun 10 '21 at 21:10
That's really interesting and not very obvious given how small 65536 is. — Simd, Jun 11 '21 at 08:07

Adam · Answer 2 · 2015-08-24T18:19:31.373

4

I see a few options, roughly in order of easiest to hardest.

Enable link-time optimization with the -flto switch. This may get the compiler to inline your comparison function. It's too easy not to try.
If LTO has no effect, you can use an inline qsort implementation like Michael Tokarev's inline qsort. This page suggests a 2x improvement, again solely due to the compiler's ability to inline the comparison function.
Use the C++ std::sort. I know your code is in C, but you can make a small module that only sorts and provides a C interface. You're already using a toolchain that has great C++ support.
Try swenson/sort's library. It implements many algorithms so you can use the one that works best on your data. It appears to be inlineable, and they claim to be faster than qsort.
Find another sorting library. Something that can do Louis' Radix Sort is a good suggestion.

Note you can also do your comparison with a single branch instead of two. Just find out which is bigger, then subtract.

edited Aug 24 '15 at 18:19

answered Aug 24 '15 at 18:14

Adam

16,808
7
52
98

Subtracting two `uint64_t` might generate a value that doesn't fit into an `int`. If not for that you could eliminate the branches altogether and just return the result of the subtraction. – Mark Ransom Aug 24 '15 at 20:59
>> Subtracting two uint64_t might generate a value that doesn't fit into an int. I found out that half of the subtractions generated negative results which screwed up the entire sort. The dual step process was designed to sidestep this problem. I have switched to RADIX_SORT anyway; 4X faster! – BrianP007 Aug 24 '15 at 21:52
@BrianP007 the other issue with subtracting unsigned ints is that they cannot be negative. – Adam Aug 24 '15 at 22:53
@BrianP007 if you decided to go with Louis' answer, you should mark it as 'accepted' by clicking the checkmark next to his answer. – Adam Aug 24 '15 at 22:54

score 1 · Answer 3 · answered Aug 24 '15 at 21:07

With some compilers/platforms the following is branch-less and faster, though not much different than OP's original.

int comp_uint64_b(const void *a, const void *b)  {
    return 
     (*((uint64_t *)a) > *((uint64_t *)b)) - 
     (*((uint64_t *)a) < *((uint64_t *)b));
}

score -3 · Answer 4 · answered Aug 24 '15 at 17:52

-3

Maybe some ?: instead of ifs would make things a tad quicker.

answered Aug 24 '15 at 17:52

Lambder

1

Micro-optimizations like this are usually bad - let the compiler do its thing. – tonysdg Aug 24 '15 at 17:57
1

That's highly unlikely. Any competently constructed compiler will generate the same code for a traditional `if ... else` and the corresponding ternary (i.e. `?:`) expression. – Jim Mischel Aug 24 '15 at 17:57

Fastest sort algorithm for millions of UINT64 RGBZ graphics pixels

First attempt at a better `qsort`, `QSORT()`!

Next attempt: `RADIXSORT`:

4 Answers4

Linked

Fastest sort algorithm for millions of UINT64 RGBZ graphics pixels

First attempt at a better qsort, QSORT()!

Next attempt: RADIXSORT:

4 Answers4

Linked

First attempt at a better `qsort`, `QSORT()`!

Next attempt: `RADIXSORT`: