Use a temporary register (other than EFLAGS) to make this lower latency on CPUs without single-cycle adc
:
mov ecx, eax
bswap eax
shl eax, 7 ; top bit in place
shr ax, 7+7 ; bottom bit in place (without disturbing top bit)
and ecx, 0x7ffffffe ; could optimize mov+and with BMI1 andn
and eax, 0x80000001
or eax, ecx ; merge the non-moving bits with the swapped bits
On Intel CPUs before Sandybridge, shr ax
and then reading EAX will suck (partial register stall).
This looks like 5 cycle latency from input to output, same as the adc
/adc
version of @Fuz's on CPUs where that's single-cycle latency. (AMD, and Intel since Broadwell). But on Haswell and earlier, this may be better.
We could save the mov
using either BMI1 andn
with a constant in a register, or maybe BMI2 rorx ecx, eax, 16
to copy-and-swap instead of doing bswap
in place. But then the bits are in less convenient places.
@rkhb's idea to check if the bits differ and flip them is good, especially using PF to check for 0 or 2 bits set vs. 1. PF is only set based on the low byte of a result, so we can't just and 0x80000001
without rotating first.
You can do this branchlessly with cmov
; untested, but I think I have the parity correct
rorx ecx, eax, 31 ; ecx = rotate left by 1. low 2 bits are the ones we want
xor edx,edx
test cl, 3 ; sets PF=1 iff they're the same: even parity
mov ecx, 0x80000001
cmovpo edx, ecx ; edx=0 if bits match, 0x80000001 if they need swapping
xor eax, edx
With single-uop cmov
(Broadwell and later, or AMD), this is 4 cycle latency. The xor-zeroing and mov-immediate are off the critical path. The mov-immediate can be hoisted out of a loop, if you use a register other than ECX.
Or with setcc
, but it's worse (more uops), or tied on CPUs with 2-uop cmov
:
; untested, I might have the parity reversed.
rorx ecx, eax, 31 ; ecx = rotate left by 1. low 2 bits are the ones we want
xor edx,edx
and ecx, 3 ; sets PF=1 iff they're the same: even parity
setpe dl
dec edx ; 0 or -1
and edx, 0x80000001
xor eax, edx