0

I want to get the most speed-optimal code out of this frequently run comparison:

int Object::Compare(Object *other) {
    if(id < other->id) return -1;
    if(id > other->id) return 1;
    return 0;
}

The id var is an unsigned 32-bit integer.

What I'm finding is that even with speed optimizations turned on via a pragma (ordinarily the debug build doesn't use optimizations), I'm still seeing awful inefficiency in the assembly where it's reading the id vars twice and comparing twice, even though I only need the comparison to be done once.

Variations on this code have not helped. In this specific case I'm compiling in Visual C++ and looking at the output, although the code is meant to also compile for Linux environments (still x86 though). What I'm wondering is, is there a better way to structure this so the compiler can understand it should only compare once and then use comparison flags to handle the rest?

Even better would be if I can get any non-branching instructions like cmov in play. Any thoughts?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Lummox JR
  • 59
  • 6
  • 2
    Use a compiler that doesn't suck. https://godbolt.org/z/dqjGGYs4r is a version simplified to take two `int *` args, to simulate your arg and `this`. clang and GCC only run `cmp` once; clang branches for `-1`, and then uses `setcc`. GCC uses setcc+cmov. Of course if you're branching on the result, and this can inline, i.e. it's not hidden behind virtual function dispatch, then have a look at the callsite, not a stand-alone version of this function. (Optimizations need to be on, obviously.) – Peter Cordes Apr 18 '21 at 06:40
  • That's a pretty straightforward code, besides adding `const noexcept`, I'd leave as it is. What `pragma` are you using? – MatG Apr 18 '21 at 06:43
  • 1
    MSVC's code-gen loads once. It repeats the `cmp` but that lets it use `xor`-zeroing to zero-extend the setcc result. Otherwise basically the same strategy as clang's: https://godbolt.org/z/s8cs59h78. Are you sure you enabled optimization properly for MSVC? – Peter Cordes Apr 18 '21 at 06:44
  • 1
    It would help if you provided a [mcve] - a small but complete sample of code that presents your concern, including description of the compiler options. With a disconnected fragment like you have supplied, there are all sorts of possibilities (e.g. `id` is a `volatile` member, the `id` member is a type with overloaded comparison operators or volatile members, etc etc). – Peter Apr 18 '21 at 07:01
  • "*with speed optimizations turned on via a pragma*" - what pragma? – rustyx Apr 18 '21 at 07:18
  • 1
    How about `return (other->id < id) - (id < other->id);`? – Scheff's Cat Apr 18 '21 at 07:20
  • 1
    What’s wrong with returning the difference? – zdf Apr 18 '21 at 07:24
  • 1
    @zdf: Because that wouldn't have the compare-result info if the subtraction wraps. It takes up to N+1 bits to hold the full result of an N-bit addition or subtraction. That's why (in x86 asm) the unsigned below condition is CF == 1 (carry-out = borrow), not SF == 1 (MSB). You *could* return `id - int64_t(other->id)` as a signed integer, if the inputs are actually unsigned *int*, i.e. a narrower type. That would be efficient on a 64-bit system, since you just zero-extend both inputs (usually free after inlining) and subtract. – Peter Cordes Apr 18 '21 at 09:07
  • 1
    @PeterCordes I assumed a cast to a larger signed integer; sorry. [`return int64_t(a) - int64_t(b)`](https://godbolt.org/z/e46Ersaej). – zdf Apr 18 '21 at 10:38
  • @zdf: Ok yes, that works. But you forgot to enable optimization, and you defeated the purpose by casting back to `int`. You have to actually return `int64_t`. https://godbolt.org/z/ja3Yecac6. Enabling optimization makes the problem obvious in the asm: still only a 32-bit subtraction with your original. I included a fixed version that is way more efficient than setcc/cmov. – Peter Cordes Apr 18 '21 at 10:49
  • @PeterCordes The code in question is 32-bit. – Lummox JR Apr 18 '21 at 20:55
  • 1
    Just wanted to add that apparently the pragma is useless. The only way I can really tell what's going on is to get ASM output from the release code generation. I hate having to do that because it's much harder to see changes, but the pragma has been a complete failure. – Lummox JR Apr 18 '21 at 21:27
  • I've included the suggestion by @Scheff in this test: https://godbolt.org/z/qohqEffhT. That version is branchless, but is it in fact faster? How would I be able to test that effectively? – Lummox JR Apr 18 '21 at 21:33
  • @LummoxJR: Oh, you're using the Windows meaning of "x86", where you specifically mean obsolete 32-bit mode. Not the CPU-architecture meaning where it means x86 including extensions like x86-64, as opposed to totally different ISAs like AArch64. – Peter Cordes Apr 18 '21 at 21:34
  • @LummoxJR: [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391). MSVC's version of materializing both the booleans with `cmp`/`sbb`/`neg` (at least 3 cycle dependency chain) and then subtracting them doesn't look very good compared to GCC's `setcc` / `cmov`. SBB and CMOV are both 2-uop instructions on Intel before Broadwell, but MSVC's uses SBB (and everything else) twice, taking more instructions for front-end throughput. And I think worse critical-path latency. – Peter Cordes Apr 18 '21 at 21:39
  • But it remains to be seen how this would optimize into a caller that did `if (px->Compare(py) >= 0) swap(...)` or something; you want something that can inline and reduce to a `cmp / jae`. If that's your normal use-case, then the stand-alone version that actually creates a boolean is less important. – Peter Cordes Apr 18 '21 at 21:41

0 Answers0