9

I want to do some long integer math (128 bit) with Intel I64 Assembler and need to create a 2's complement. Let's say that my positive value is in RDX:RAX.

2's complement is done by "flip the bits and add 1". So the most naive implementation is (4 instructions and 14 bytes of code):

  NOT RAX
  NOT RDX
  ADD RAX,1   ; Can't use INC, it doesn't set Carry
  ADC RDX,0

When I use the NEG instruction on RAX instead of NOT, it does the "+1" for me but the Carry is wrong, NEG RAX clears Carry when RAX was zero, but I need the Carry JUST IN THIS CASE. So the next best approach might be (4 instructions and 11 bytes of code):

  NOT RDX
  NEG RAX
  CMC
  ADC RDX,0                  ; fixed, thanks lurker

Still 4 instructions. But instead of adding +1, I can subtract -1 and since SBB adds the Carry-Bit to the subtrahend, I will add +1 when Carry is clear. So my next best try is this, with 3 instructions and 10 bytes of code:

   NOT RDX
   NEG RAX
   SBB RDX,-1

As you can see from my long winded text, this is not obvious to understand. Is there a better, more understandable way to do a cascaded 2's complement in assembler?

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Rolf
  • 730
  • 6
  • 14
  • 4
    You seem to assume that "better" equals to "shorter code" and this is something that don't have to apply to an out-or-order multiscalar processor, as the x86-64 is. I'd say the most understable of your implementations is the first one, and I wouldn't be surprised if all of them take the same time to execute. – mcleod_ideafix Apr 03 '15 at 20:08
  • BTW: have you considered using the XMM registers? They are wide enough to hold a 128 bit number and (I haven't checked) they might have integer instructions to deal with the whole number – mcleod_ideafix Apr 03 '15 at 20:10
  • 3
    @mcleod_ideafix they don't, so you're still left with the problem of carrying the carry manually. – harold Apr 03 '15 at 20:49
  • I think you mean `ADC RDX,0` as the last instruction in your second example. – lurker Apr 03 '15 at 22:02
  • 1
    Thanx@lurker, I fixed it. And yes, I considered XMM registers. They are made for vectors of integers where carry propagation is not an option. So they give you the choice between undetected overflow or integer saturation. Not good for my purpose. – Rolf Apr 03 '15 at 22:51
  • It DID cross my mind that my problem might be irrelevant because of the optimizing hardware. But I'm a guy born in the 1960s and grew up counting bytes and cycles. It became a habit, I think :) – Rolf Apr 03 '15 at 22:54
  • 4
    Looking at Agner Fog's instruction tables, surprisingly `CMC` is low latency, high throughput on modern microarchitectures, so the second version is probably competitive. On the other hand, consider the dependency chains: 1: `ADC` depends on `NOT`/`ADD` via flags, depends on `NOT`. 2: `ADC` depends on `NOT`/`CMC` depends on `NEG`. 3: `SBB` depends on `NOT`/`NEG`. I'd say you've found a very clever way with the last version. – EOF Apr 03 '15 at 22:57
  • SIMD registers like XMM/YMM/ZMM are meant for multiple values at the same time, [not for multiprecision integers](http://stackoverflow.com/questions/12200698/is-it-possible-to-use-sse-v2-to-make-a-128-bit-wide-integer) – phuclv Apr 04 '15 at 03:04
  • Thanks EOF for the hint to Agner Fog. I will look into this. – Rolf Apr 05 '15 at 17:51

2 Answers2

3

Shorter instructions or less number of instructions doesn't necessarily means faster execution because the latency and throughput for each instruction are different

For example obsolete instructions like enter, dad, loop... will be very slow and they're there only for backward compatibility. Even inc is sometimes slower than add. The same to cmc you used above on some μarchs

As a result a longer series of low latency instructions that can be executed in parallel will work much faster. Some common groups of instructions can even be fused together into a single macro-op. The compilers' optimizers always know this and will choose the most appropriate instructions to emit.

For this snippet

__int128 negate(__int128 x)
{
    return -x;
}

ICC 19.0.1 will generate the following instructions

    xor       edx, edx                                      #3.13
    xor       eax, eax                                      #3.13
    sub       rax, rdi                                      #3.13
    sbb       rdx, rsi                                      #3.13

The first two xor instructions cost zero μop, because they're handled at the register-rename stage. Now you have only 2 instructions to execute

You can switch the compiler in Godbolt link above to see various ways to negate by different compilers including MSVC (unfortunately it doesn't have a 128-bit type yet). Below are results for GCC and Clang

GCC 8.3:

    mov     rax, rdi
    neg     rax
    mov     rdx, rsi
    adc     rdx, 0
    neg     rdx

Clang:

    mov     rax, rdi
    xor     edx, edx
    neg     rax
    sbb     rdx, rsi

As you can see, Clang also uses only 3 instructions (minus the first one to move data from the input argument to the necessary destination). But like xor reg, reg, mov can also be "free"

Things may be different if you optimize for space (like in some cases where cache misses are high), because some immediates and instructions are long

Whether it's faster or not need some micro benchmarking. But on Intel CPUs, Intel compiler (ICC) often achieves higher performance than others because it understands the architecture better.

Note that that operation is called negation, not two's complement which is a way to encode negative numbers

phuclv
  • 37,963
  • 15
  • 156
  • 475
  • `dad` is not an x86 mnemonic, `aam` and the like are invalid in x86-64. According to Ager Fog's instruction tables, `cmc` is fast on every Intel/AMD microarchitecture since the P4. – EOF Apr 04 '15 at 09:41
  • Perhaps I should try GCC sometimes, at least to look at the code it generates. I was using Visual Studio 2013. – Rolf Apr 05 '15 at 17:50
  • https://gcc.godbolt.org/ will help you see the assembly output of several of the most common compilers – phuclv Apr 07 '15 at 03:30
3

BTW, negating a 2-register number is the same in 32-bit or 16-bit mode, with EDX:EAX or DX:AX. Use the same instruction sequences.


To copy-and-negate, @phuclv's answer shows efficient compiler output. The best bet is xor-zeroing the destination and then using sub / sbb.

That's 4 uops for the front-end on AMD, and on Intel Broadwell and later. On Intel before Broadwell, sbb reg,reg is 2 uops. The xor-zeroing is off the critical path (can happen before the data to be negated is ready), so this has a total latency of 2 or 3 cycles for the high half. The low half is of course ready with 1 cycle latency.

Clang's mov/neg for the low half is maybe better on Ryzen, which has mov-elimination for GP integer, but still needs an ALU execution unit for xor-zeroing. But for older CPUs, it puts a mov on the critical path for latency. But usually back-end ALU pressure is not as big a deal as front-end bottlenecks, for instructions that can use any ALU port.


To negate in-place, use neg to subtract from 0

neg   rdx              ; high half first
neg   rax             ; subtract RDX:RAX from 0
sbb   rdx, 0          ; with carry from low to high half

neg is exactly equivalent to sub from 0, as far as setting flags and performance.

ADC/SBB with an immediate 0 is only 1 uop on Intel SnB/IvB/Haswell, as a special case. It's still 2 uops on Nehalem and earlier, though. But without mov-elimination, mov to another register and then sbb back into RDX would be slower.

The low half (in RAX) is ready in the first cycle after it's ready as an input to neg. (So out-of-order execution of later code can get started using the low half.)

The high half neg rdx can run in parallel with the low half. Then sbb rdx,0 has to wait for rdx from neg rdx and CF from neg rax. So it's ready at the later of 1 cycle after the low half, or 2 cycles after the input high half is ready.


The above sequence is better than any in the question, being fewer uops on very common Intel CPUs. On Broadwell and later (single-uop SBB, not just for immediate 0)

;; equally good on Broadwell/Skylake, and AMD.  But worse on Intel SnB through HSW
NOT RDX
NEG RAX
SBB RDX,-1     ; can't use the imm=0 special case

Any of the 4-instruction sequences are obviously sub-optimal, being more total uops. And some of them have worse ILP / dependency chains / latency, like 2 instructions on the critical path for the low half, or a 3-cycle chain for the high half.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847