1

I am using compiler explorer to look at some outputs from gcc and clang to get an idea of what assembly these compilers emit for some code. Recently I looked at the output of this code.

int compare_int64(int64_t left, int64_t right)
{
    return (left < right) ? -1 : (left > right) ? 1 : 0;
}

The point of this exercise is not for C++ where this code might be inlined anyway but when such functions are being called.

With -O3 this is the output:

clang:

xor     ecx, ecx
cmp     rdi, rsi
setg    cl
mov     eax, -1
cmovge  eax, ecx
ret

gcc:

xor     eax, eax
cmp     rdi, rsi
mov     edx, -1
setg    al
cmovl   eax, edx
ret

I noticed this code is 17 bytes in size which is just 1 byte over a nice 16 byte (the default code alignment for x64 in another non-C++ compiler I am using is 16). For the gcc code shown I was thinking of either using lea edx,[eax-1] or or edx,-1 (latter before the cmp of course) to reduce the code size. Interestingly when using -Os gcc inserts a jl instruction which is kinda disastrous for the performance of that function.

I am no expert and looked into the instruction tables manual by Agner Fog and if I am not mistaken for mov, lea and or the timings/latency are equal.

So the actual question(s): Why do both compilers use a 5byte size instruction instead of a shorter 3- or 4-byte instruction? Would it be harmless to replace the mov reg,-1 with lea reg,[eax-1] or or reg,-1?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Stefan Glienke
  • 20,860
  • 2
  • 48
  • 102
  • 3
    `or edx, -1` probably appears to have an input dependency on the previous contents of `edx`. Mathematically it doesn't, but the machine probably doesn't know that. (`xor edx, edx` would have the same problem, but it's special-cased.) – Nate Eldredge Apr 25 '21 at 20:59
  • 1
    Note `clang -Oz` gets it down to 3 bytes with `push -1 ; pop eax`. – Nate Eldredge Apr 25 '21 at 21:35
  • I tried the `push/pop` but that seems to be a little slower. – Stefan Glienke Apr 25 '21 at 21:54
  • "So the actual question(s): Why do both compilers use a 5byte size instruction instead of a shorter 3- or 4-byte instruction? Would it be harmless to replace the mov reg,-1 with lea reg,[eax-1] or or reg,-1?" Wouldn't you have buffer overflow if you use lesser size? you would be comparing 32 bit number with 64 bit number? –  Apr 25 '21 at 22:27
  • 1
    Eh? the question is about reducing the actual size in bytes of the assembly instruction needed to put a -1 into a register. – Stefan Glienke Apr 25 '21 at 22:32
  • 2
    Yes, of course `push`/`pop` is slower than `mov`, that's why clang only uses that at `-Oz`, extreme size optimization, even with serious performance penalties. Unfortunately clang `-Oz` isn't smarter about using `lea eax, [rcx - 1]` (3 bytes). Note the RCX, you don't want an address-size prefix. – Peter Cordes Apr 25 '21 at 23:47
  • @CEPB: No idea what you're talking about, that makes no sense. There is no buffer here. Perhaps you're missing [Why do x86-64 instructions on 32-bit registers zero the upper part of the full 64-bit register?](https://stackoverflow.com/q/11177137), and/or that the return value *is* only `int`, so the `cmp` needs 64-bit operand-size to set FLAGS according to the int64_t inputs, but everything else can use 32-bit operand-size (and 64-bit address-size) to avoid any REX.W or 66 / 67 prefix bytes. – Peter Cordes Apr 25 '21 at 23:50
  • lol sorry guys I am not assembly expert in fact almost newby, I understand whats going on but have no idea why are some things done the way they are done. But i was thinking that he is asking about 32 vs 64 bit data not instructions. –  Apr 25 '21 at 23:53
  • 1
    Thanks for mentioning the `lea eax, [rcx-1]` instead of `lea eax, [ecx-1]` for getting 3 bytes, I missed that! – Stefan Glienke Apr 26 '21 at 00:00

1 Answers1

4

When optimizing for speed mov reg, -1 is used instead of or reg, -1 because the former uses the register as a "write-only" operand, which CPU knows about and uses that to schedule it efficiently (out of order). Whereas or reg, -1, even though will always produce -1 is not recognized by the CPU as a dependency-breaking (write only) instruction.

To illustrate how it can affect performance:

mov eax, [rdi]  # Imagine a cache-miss here.
mov [rsi], eax
mov eax, -1     # `mov eax, -1` is able to dispatch and execute without waiting
                # for the cache-miss to be served.
add eax, edx    # `add eax, edx` only needs to wait 1 cycle for `mov` to
                # complete (assuming `edx` is ready) and then it can
                # dispatch while cache-miss load from a few lines above
                # is still in progress.

And now this code:

mov eax, [rdi]   # Imagine a cache-miss here.
mov [rsi], eax
or eax, -1       # Now this instruction has to wait for the cache-miss
                 # load to complete.
add eax, edx     # And this one will be waiting too.

(Example applies for any current x86-64 CPU, such as Skylake/Ice Lake/Zen).

If you're writing the code in assembly and certain that a register is not part of a dependency-chain that's currently in progress, you can use or reg, -1 and it'll have no negative effect (if your assumptions are right, of course).

Because of that danger of accidentally attaching to a dependency chain compilers do not generally use or reg, -1 for producing -1 when optimizing for speed.

When we need a zero, instead of -1, we're in luck because there are idioms that CPUs recognize, for example xor reg, reg and sub reg, reg. They're smaller in code size, and CPUs recognize that the result of the computation does not depend on the register (always zero).

These zero-idioms, on top of being smaller in code size, are also usually processed by the front-end part of the CPU, so the instructions depending on the result will immediately be able to dispatch.

Zero-idioms also work for vector registers: vpxor xmm0, xmm0, xmm0 (produce zero with no dependency on previous value of xmm0 and zero-latency). What is interesting is that vector registers also have a -1 idiom, which is vpcmpeqd xmm0, xmm0, xmm0 - this one is recognized as write-only (comparing the value with itself will always be true), but it still has to execute (so it has latency=1), at least on SKL/Zen CPUs.

More about producing zeros: What is the best way to set a register to zero in x86 assembly: xor, mov or and?

Specifically which idioms are recognized can be found in Agner Fog's manuals or CPU optimization guides. TLDR is general purpose registers only have zero-idioms, vector registers have zero-idioms and all-ones idioms.

See also: Set all bits in CPU register to 1 efficiently (mentions lea edx, [rax - 1]).


Note on the actual function. As you can see from assembly most of the work is actually trying to produce the specific constants that you've requested.

If all you intend to do with -1,0,1 is branch on whether it's negative/zero/positive, then it would be better to produce left - right (as long as you ensure there is no overflow because that would make the subtraction result alone insufficient for comparison - in that case just use -1, 0, 1) and then just branch/cmov on that.

stepan
  • 1,043
  • 2
  • 8
  • 12
  • Note that the latency=1 for `pcmpeqd xmm0,xmm0` is unlike normal instructions, since as you say it's dep-breaking (on almost all CPUs, IIRC not Silvermont). The back-end [schedules uops in oldest-ready-first order](https://stackoverflow.com/questions/40681331/how-are-x86-uops-scheduled-exactly), not trying to do critical-path analysis, so the all-ones result is ready 1 cycle after the back-end gets around to dispatching it to an execution port. – Peter Cordes Apr 26 '21 at 00:04
  • `left - right` has the signed-compare result in the combination of OF and SF, not just the MSB of the integer result, so it's not that simple. With 32-bit integer inputs, you could sign-extend them both to 64-bit and subtract. (And possibly right-shift and OR the high half into the low half to reduce back to input width? No, that would confuse the actual sign bit with the input-MSB position which we already know we can't use directly.) – Peter Cordes Apr 26 '21 at 00:09
  • @PeterCordes `left - right` I meant to use that result itself as return value in C++, instead of artificial constants -1, 0, 1. Unless they actually provide a benefit (having smaller size for storage for example) they should be avoided in my opinion. If the function is inlined the compiler is able to avoid producing these constants on its own https://godbolt.org/z/YWEYxx379. – stepan Apr 26 '21 at 00:43
  • 1
    Yeah, but `return left - right` loses info unless you sign-extend your inputs to a wider type before the subtract. That was the point of my comment. That's why `jge` has to be SF==OF, not just SF==0 (https://www.felixcloutier.com/x86/jcc). (Widening makes signed-overflow impossible, so you can just check the sign of the integer result. It also avoids UB in C). Note that if you `return left - right` and compile with `-fwrapv` to not ignore the possibility of signed-overflow, that requires clang to `sub` / `test reg,reg` after inlining. https://godbolt.org/z/Tb4dc1bfj – Peter Cordes Apr 26 '21 at 00:52
  • @PeterCordes ah, that makes sense yeah good point. I've edited the "note" to mention it needing to be the same bit width as the inputs. – stepan Apr 26 '21 at 06:59
  • Still nope, you need a *65* bit signed integer for its sign to accurately reflect the comparison result of two 64-bit integers. That's why I said it only works if you sign-extend both integer inputs (to a wider type) before subtracting. e.g. `int64_t` result from `int32_t` inputs. Equal width fails on signed overflow. (Which is incidentally UB in C, while `INT_MAX > -123` is well-defined.) – Peter Cordes Apr 26 '21 at 07:15
  • 1
    @PeterCordes I see what you're talking about now, thank you. This may be a good reason to use -1, 0, 1 then, to have it well defined for entire input range. – stepan Apr 26 '21 at 07:24
  • Yes, it's too bad there isn't a cheap single asm operation that maps 2 integers to a 3-way compare result (as a single integer that can be tested later, rather than in FLAGS), but AFAIK there isn't. It would need a new instruction like cmp3way reg, reg, r/m`, or `set3way r/m32` (please not r/m8) to materialize a -1 / 0 / 1 compare result from SF, OF, and ZF. (Existing C code will often return -1 / 0 / +1 so if you're introducing HW support, you want it to booleanize (trinarize?) to work as a peephole for that as well as - / 0 / + with don't-care magnitudes.) – Peter Cordes Apr 26 '21 at 07:31