BTW, negating a 2-register number is the same in 32-bit or 16-bit mode, with EDX:EAX or DX:AX. Use the same instruction sequences.
To copy-and-negate, @phuclv's answer shows efficient compiler output. The best bet is xor-zeroing the destination and then using sub
/ sbb
.
That's 4 uops for the front-end on AMD, and on Intel Broadwell and later. On Intel before Broadwell, sbb reg,reg
is 2 uops. The xor-zeroing is off the critical path (can happen before the data to be negated is ready), so this has a total latency of 2 or 3 cycles for the high half. The low half is of course ready with 1 cycle latency.
Clang's mov/neg
for the low half is maybe better on Ryzen, which has mov-elimination for GP integer, but still needs an ALU execution unit for xor-zeroing. But for older CPUs, it puts a mov
on the critical path for latency. But usually back-end ALU pressure is not as big a deal as front-end bottlenecks, for instructions that can use any ALU port.
To negate in-place, use neg
to subtract from 0
neg rdx ; high half first
neg rax ; subtract RDX:RAX from 0
sbb rdx, 0 ; with carry from low to high half
neg
is exactly equivalent to sub
from 0, as far as setting flags and performance.
ADC/SBB with an immediate 0
is only 1 uop on Intel SnB/IvB/Haswell, as a special case. It's still 2 uops on Nehalem and earlier, though. But without mov-elimination, mov
to another register and then sbb
back into RDX would be slower.
The low half (in RAX) is ready in the first cycle after it's ready as an input to neg
. (So out-of-order execution of later code can get started using the low half.)
The high half neg rdx
can run in parallel with the low half. Then sbb rdx,0
has to wait for rdx
from neg rdx
and CF from neg rax
. So it's ready at the later of 1 cycle after the low half, or 2 cycles after the input high half is ready.
The above sequence is better than any in the question, being fewer uops on very common Intel CPUs. On Broadwell and later (single-uop SBB
, not just for immediate 0)
;; equally good on Broadwell/Skylake, and AMD. But worse on Intel SnB through HSW
NOT RDX
NEG RAX
SBB RDX,-1 ; can't use the imm=0 special case
Any of the 4-instruction sequences are obviously sub-optimal, being more total uops. And some of them have worse ILP / dependency chains / latency, like 2 instructions on the critical path for the low half, or a 3-cycle chain for the high half.