Is there a way to code a qsort comparison function for unsigned integers without using lots of branching statements?

Question

I wrote a (qsort-compatible) comparison function for a struct that has some unsigned fields in it:

typedef struct {
   int      a;
   unsigned b;
} T;

int cmp(T t1, T t2)
{
   // Decreasing order in "a"
   if (t1.a < t2.a) return +1;
   if (t1.a > t2.a) return -1;
   // Increasing order in "b"
   if (t1.b < t2.b) return -1;
   if (t1.b > t2.b) return +1;
   return 0;
}

Is there a way to write this function without needing two comparisons per field? I can't use the t1.b - t2.b trick because subtraction for unsigned integer wraps around.

I'm willing to accept answers using GCC extensions.

@user3386109: The answer to the question as stated in the title is simply "Yes". You can put your entire `cmp` function definition on one line. Of course you *shouldn't*, but I don't think putting it on one line is really your goal. I suggest updating your title to reflect what you're actually asking. — Keith Thompson, Jun 09 '15 at 15:41
Arethere any limits known on `a` and `b` values, or are they using whole possible range of their types? — Suma, Jun 09 '15 at 16:38
@Suma: No limits, in principle. But there is an alternative solution that depends on having a more restricted range then I'm also curious how it would look like. — hugomg, Jun 09 '15 at 16:40
I guess even though it's unsigned you should be able do draw some information out from the relation between (a-b) and (b-a). I guess there is some weird arithmetic trick to get the result arithmetically encoded, but it's going to be obfuscated and I'm uncertain if it's going to be faster. — Alexander Oh, Jun 09 '15 at 21:20
I was completely wrong, no need to perform any optimization here: even if they're branches compiler is smart enough to remove them. Even in your original code there isn't any `Jcc`, in my case it generated `CMOV` and full **function is branchless**. — Adriano Repetti, Jun 10 '15 at 13:42
Indeed the early-return function is branchless. The real performance loss is in the function calling overhead and instruction-pipeline drain. — joop, Jun 29 '15 at 11:44
Funny that nobody noticed that the compare function does not have the right signature for `qsort()`. It should be `int cmp(T *t1, T *t2);` 2 pointers, not two structures copied by value. — Patrick Schlüter, Jul 02 '18 at 09:31

score 2 · Answer 1 · edited May 23 '17 at 10:28

2

Use wider math.

Given int and unsigned fields, a given platform very likely supports a wider integer type such as long long that accommodates putting these 2 together.

int cmp(T t1, T t2)
{
   // An optimized compilation will not do any multiplication nor addition,
   // Just simply load `n1` high/low halves with `t1.a`, `t1.b`.
   long long n1 = t1.a * (UINT_MAX + 1LL) + t1.b;
   long long n2 = t2.a * (UINT_MAX + 1LL) + t2.b;
   return (n1 > n2) - (n1 < n2);  
}

If this approach is faster - profiling will answer that for select platforms.

Although this uses fewer compares, the compares use wider math - possible a zero sum gain.

When a 2x integer width is available as in How to determine integer types that are twice the width as `int` and `unsigned`?. This works. For high portability, stick with OP's original approach.

The (var1 > var2) - (var1 < var2) is considered by some to be branch-less. Of course OP's original code could end with:

return (t1.b > t2.b) - (t1.b < t2.b);

edited May 23 '17 at 10:28

Community

1
1

answered Jun 09 '15 at 20:59

chux - Reinstate Monica

143,097
13
135
256

1

As I said for Keith's answer: even if it's written without any `if` compiler still generates `jg`, `jl` and (possibly) `je` instructions to evaluate `t2->a > t1->a`. **They're branches** (even if less obvious so I hardly consider them branchless unless a very specific architecture has specialized instructions to do it without branches - yes `cmov` may help but I didn't see any compiler to generate it) but to evaluate full expression there are 2/3 to 4/5 branches. – Adriano Repetti Jun 10 '15 at 06:44
You might want to explain in the answer even on x86 platform the number of branches will still be reduced, in spite of arithmetics (add, compare) being implemented using two instructions, because they use carry flags to handle overflows between them not branching. Perhaps adding a disassembly of some compiler output would help? – Suma Jun 10 '15 at 07:30
@Adriano Repetti Agree that `t2->a > t1->a` may evaluate to code that branches. Agree that various processors have instructions that would generate branch-less code. With processors that are sensitive to branching (pipelined) are more likely to have than branch-less compare instrucitons than simple ones (e.g. embedded). – chux - Reinstate Monica Jun 10 '15 at 13:56
1

@chux yes, I wrongly assumed (without checking) that compiler won't optimize this particular case. We may reasonably assume _almost_ every half-modern CPU has something equivalent to SETcc (if CMOV isn't available) but yes...the others still have (if needed) to rewrite this without branches (IMO with subtraction and some tricks). – Adriano Repetti Jun 10 '15 at 14:36
@Adriano Repetti Note on modern CPUs: Embedded processors - often with reduced instruction sets - are made in the billions per year in 2015. C is popular in such devices and not having a SETcc equivalent is not uncommon. In the end - having branch-less or not is not the real issue - but a stop toward the real issues: speed, power, cost. – chux - Reinstate Monica Jun 10 '15 at 15:18
I absolutely agree with that, I mean I wouldn't bother (see my deleted answer) to avoid few branches before checking my compiler (for CPU I'm targeting) won't do better than me... – Adriano Repetti Jun 10 '15 at 15:20

Keith Thompson · Answer 2 · 2015-06-10T14:44:00.553

Any relational comparison between two values can only yield one of two results. You need three distinct results for a qsort comparison function, so a single comparison can't do the job. (Perl has a <=> operator that does exactly what you want, but it's not available in C.)

You'll need to evaluate 1 or 2 comparisons to compare the a values, plus 1 or 2 comparisons to compare the b values, for a total of up to 4 comparisons. You can write code that performs them more tersely, but it's going to be essentially equivalent to what you've already written.

Here's one possible slightly tricky solution:

int cmp(T t1, T t2) {
    return ((t2.a > t1.a) - (t2.a < t1.a)) || ((t2.b > t1.b) - (t2.b < t1.b));
}

I'd split it up like this:

int cmp(T t1, T t2) {
    return ((t2.a > t1.a) - (t2.a < t1.a))
           ||
           ((t2.b > t1.b) - (t2.b < t1.b));
}

The first half of the expression yields 0 if t1.a and t2.a are equal, -1 if t1.a < t2.a, and +1 if t1.a > t2.a. It depends on the fact that the relational operators always return either 0 or 1.

If the first half is either -1 or +1, the || short-circuits, and we're done; otherwise it goes on to compare the t1.b vs. t2.b.

This might actually be slightly less efficient than the code in your question since it always evaluates both t2.a > t1.a and t2.a < t1.a.

Incidentally, that's not a valid qsort comparison function. Such a function must take two const void* arguments. It can be written like this:

int cmp(const void *arg1, const void *arg2) {
    const T *t1 = arg1;
    const T *t2 = arg2;
    return ((t2->a > t1->a) - (t2->a < t1->a))
           ||
           ((t2->b > t1->b) - (t2->b < t1->b));
}

correct me if I'm wrong but even if it's written without any `if` compiler still generates `jg`, `jl` and (possibly) `je` instructions to evaluate `t2->a > t1->a`. **They're branches** (even if less obvious) but to evaluate full expression there are 2/3 to 4/5 branches. — Adriano Repetti, Jun 10 '15 at 06:42
"so a single comparison can't do the job" While this is true, it still does not prove there is no way without (or with less) comparisons. Arithmetic operations are able to produce more than two results, and it is possible some clever combination of arithmetic operations could create just the results needed. — Suma, Jun 10 '15 at 07:36
@KeithThompson I have to correct myself: even if they're branches compiler is smart enough to remove them. I guess because of boolean subtraction it generates `condition ? ~0 : 0` (it uses `SETLE` and `SETGE`). Even OP's code will generate `CMOV`without `jmp`... — Adriano Repetti, Jun 10 '15 at 13:40
@AdrianoRepetti: You're making some assumptions about the target system. The `SETLE`, `SETGE`, and `CMOV` instructions might not even exist. — Keith Thompson, Jun 10 '15 at 14:42
@KeithThompson of course. I assumed compiler won't use them in this case (even when supported). Obviously if they're not available then _this_ method will still generate many jumps (see disassembly in my deleted answer). — Adriano Repetti, Jun 10 '15 at 14:59

Suma · Answer 3 · 2015-06-10T21:22:13.390

1

Assuming a restricted range of input values, a in range INT_MIN/2 .. INT_MAX/2, b in range 0 .. UINT_MAX/2, and assuming 2nd complement integer arithmetics, you can implement the compare function with one branch only:

int cmp(T t1, T t2)
{
   // Decreasing order in "a"
   int d = t2.a - t1.a;
   if (d) return d;

   // Increasing order in "b"
   return (int)(t1.b - t2.b);
}

Visual Studio 2013 disassembly:

  int d = t2.a - t1.a;
00FC1000  mov         eax,dword ptr [esp+0Ch]  
00FC1004  sub         eax,dword ptr [esp+4]  
  if (d) return d;
00FC1008  jne         cmp+12h (0FC1012h)  

  // Increasing order in "b"
  return (int)(t1.b - t2.b);
00FC100A  mov         eax,dword ptr [esp+8]  
00FC100E  sub         eax,dword ptr [esp+10h]  
}
00FC1012  ret

edited Jun 10 '15 at 21:22

answered Jun 10 '15 at 07:55

Suma

33,181
16
123
191

@chux Which is why I explicitly wrote assumptions about restricted range of input values, so that there is no overflow possible. – Suma Jun 10 '15 at 18:29
I see that qualification now - making this answer OK within that restriction. – chux - Reinstate Monica Jun 10 '15 at 19:10

Adriano Repetti · Answer 4 · 2015-06-29T11:35:00.567

When you may ignore this answer: all reasoning about branching is useless if compiler will generate branchless code both for Keit's answer and even for original OP's code (Keit's one is treated as condition ? ~0 : 0 and OP's one will generate CMOV).

Of course you may target a CPU without SETcc and CMOVcc instructions. In this case yes, I'd avoid branches (if possible) using subtraction (doing a small performance test to determine what is faster between long long and double). If you other preconditions and range limitation isn't an issue you may even go with plain integer math.

When you shouldn't ignore this answer: if your target CPU has not CMOVcc and/or SETcc (or equivalent) instructions.

You don't need to return exactly +1 and -1, any positive or negative value works well (assuming you want to optimize this function to reduce jumps, math operations are relatively cheap). If we can make assumptions about platform specific signed integers implementation (2's complement) and unsigned/signed conversion then first step to remove branches (introducing cheap subtractions) is:

int cmp(T t1, T t2) {
    if (t2.a != t1.a)
        return t2.a - t1.a;

    if (t1.b < t2.b)
        return -1;

    return (int)(t1.b - t2.b);
}

To remove 2nd branch we can rely on a well-defined behavior of unsigned (not signed) integers math: t1.b - t2.b will wrap (when t1.b is smaller than t2.b) then (int)(t1.b - t2.b) will be a negative number and 2nd if may be omitted. With that assumption code can be:

int cmp(T t1, T t2) {
    if (t2.a != t1.a)
        return t2.a - t1.a;

    return (int)(t1.b - t2.b);
}

Note 1: 2nd optimization works just in your case because you're ordering descending for T.b then it's not a general rule.

Note 2: here you're working with copied structures. Compiler may optimize your code to remove T copies but it's not required to do it then IMO you should check generated code or use pointers T* for cmp arguments (if possible). Expansive copies will vanish any other micro-optimization we may do here.

Explanation

I see some explanation is needed, if we're trying to reduce (to avoid AFAIK is impossible) branches then we have to consider both visible and invisible ones (otherwise no reason to even start this possibly micro-optimization).

Branches
Every condition (like t2->b > t1->b) is compiled with branches. Let me pick nice peace of code from Keith's answer:

((t2.a > t1.a) - (t2.a < t1.a))
||
((t2.b > t1.b) - (t2.b < t1.b))

For t2.a > t1.a compiler will produce this:

008413FE  mov  eax,dword ptr [t2]     ; Load t2.a in EAX
00841401  cmp  eax,dword ptr [t1]     ; Compare EAX with t1.a
00841404  jle  cmp+32h (0841412h)     ; Go to set result to not true
00841406  mov  dword ptr [ebp-0C4h],1 ; Result for t2.a > t1.a is 1 (true)
00841410  jmp  cmp+3Ch (084141Ch)     ; Go to perform t2.a < t1.a
00841412  mov  dword ptr [ebp-0C4h],0 ; Result for t2.a > t1.a is 0 (false)

Similar code is produced for 2nd part t2.a < t1.a. Same code is then repeated for right side of || ((t2.b > t1.b) - (t2.b < t1.b)). Let's count branches: fastest code path has five branches (jle, jmp in first part, jge, jmp in second part) for each sub-expression plus an extra jump for short-circuit of || (for a total of six branches). Slowest one has even few more. It's worse than original implementation with many ifs.

For comparison let's see what is generate for comparison with subtraction:

; if (t2.a != t1.a)
00F313FE  mov  eax,dword ptr [t2] ; Load t2.a
00F31401  cmp  eax,dword ptr [t1] ; Compare with t1.a
00F31404  je   cmp+2Eh (0F3140Eh) ; If they are equal then go work with T.b

; return t2.a - t1.a;
00F31406  mov  eax,dword ptr [t2]  ; Load t2.a
00F31409  sub  eax,dword ptr [t1]  ; Subtract t1.a
00F3140C  jmp  cmp+34h (0F31414h)  ; Finished

This is our best code path, just two branches. Let's see 2nd part:

; return (int)(t1.b - t2.b);
00F3140E  mov  eax,dword ptr [ebp+0Ch] ; Load t1.b
00F31411  sub  eax,dword ptr [ebp+14h] ; Subtract t2.b

No more branches here. Our fastest and slowest code paths always have same number of branches.

Subtractions
Why subtractions work? Let's see with simple values and some edge cases Suma picked in comments. Let's say:

t1.a = 1;
t2.a = 10;

t1.b = 10;
t2.b = 1;

Then we have:

t2.a - t1.a == 10 - 1 == 9. Positive number as required in original code (if (t1.a < t2.a) return +1;). Opposite case is trivial. Here we're assuming signed integer math will wrap.

(int)(t1.b - t2.b) == 10 - 1 == 9. Positive number as required (inverse ordering for T.a and T.b). This needs more explanation because of edge cases. Imagine t1.b is UINT_MAX and t2.b is 0. t1.b - t2.b is still UINT_MAX and it has to be casted to int, it's bit pattern is 0xFFFFFFFF, exactly bit pattern of -1 for a signed int. Result is again correct. Note that here we're assuming two important things: signed numbers are represented in 2's complement and unsigned to signed conversion simply reinterpret raw memory value with new given type (no explicit calculation is done).

As noted by Suma problems arise when numbers are big, if you want full int and unsigned int range then you may simply cast them to double:

int cmp(T t1, T t2) {
    if (t2.a != t1.a)
        return (int)((double)t2.a - t1.a);

    return (int)((double)t1.b - t2.b);
}

Extract of generated assembly code:

; return (double)t2.a - (double)t1.a;
01361926  cvtsi2sd  xmm0,dword ptr [t2]  ; Load t2.a
0136192B  cvtsi2sd  xmm1,dword ptr [t1]  ; Load t1.a
01361930  subsd     xmm0,xmm1            ; Subtract t1.a to t2.a
01361934  cvttsd2si eax,xmm0             ; Convert back
01361938  jmp       cmp+88h (01361988h)

In this way the only tuple you can't use is INT_MIN for t1.a together with INT_MAX for t2.a.

Are you sure this works? The "b" field is an unsigned integer. — hugomg, Jun 09 '15 at 15:28
Counterexample for the second version: t1.b = UINT_MAX, t2.b = 0 — Suma, Jun 09 '15 at 16:26
Even signed int comparison does not seem to work for whole range of int in the second version: assume t2.a = INT_MAX and t2.b = INT_MIN. — Suma, Jun 09 '15 at 16:28
I admit I do not get it. If `1 - 0 > 0` is OK, how can `INT_MAX - INT_MIN < 0` be OK? If `1 - 0 > 0` is OK, how can `(int)(UINT_MAX - 0) < 0` be OK? Can you explain? — Suma, Jun 10 '15 at 07:25
I still do not get it. 127 -- 128 = 255 == -1, while 127 > -128. On the other hand, 1 - 0 = +1, while 1 >0. You get different sign for the same ordering. One of them must be wrong. — Suma, Jun 10 '15 at 07:58
Of course we're assuming signed integers are represented in 2's complement. It may seems an obvious implementation but it's not mandated by C standard and not used in every architecture (see also [this post](http://stackoverflow.com/a/23264385/1207195)). — Adriano Repetti, Jun 10 '15 at 08:01
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/80144/discussion-between-suma-and-adriano-repetti). — Suma, Jun 10 '15 at 08:30
I believe `return (int)(t1.b - t2.b);` will only work if the difference between the two numbers is less than `1<<31`. Presuming that `t1.b` is `0xFFFFFFFF` and `t2.b` is `0`, the correct result is supposed to be positive, but `(int)(0xFFFFFFFF - 0)` will evaluate to `-1`. — vgru, Mar 29 '19 at 09:45
@groo yes, for "big enough" numbers it doesn't work (see the very last paragraph) — Adriano Repetti, Mar 29 '19 at 10:18

score 0 · Answer 5 · answered Jun 10 '15 at 16:57

This does not reduce the number of conditions compiled, however it does reduce the number of conditions executed:

if(t1.a != t2.a)
    return t1.a < t2.a ? -1 : 1;
if(t1.b != t2.b)
    return t1.b < t2.b ? -1 : 1;
return 0;

If the two a members are equal, no more comparision is done on them. For N-field sorting you would do maximum N+1 comparisions, compared to 2N comparisions for original code.

Is there a way to code a qsort comparison function for unsigned integers without using lots of branching statements?

5 Answers5

Explanation