1

I'm working my way through low level bit hacks, and would like to write an assembly program for each. Here is what I have for checking if a number is even or not:

is_even:
    # check if an integer is even. 
    # This is the same as seeing if its a multiple of two, i.e., & 1<<n - 1
    # rdi stores the number
    xor %eax, %eax
    test $0b1, %rdi
    setz %al
    ret

_start:
    mov $5, %rdi
    call is_even

Are there any ways to improve the above or make it more readable? Is it possible to do the is_even check with 2 instructions instead of 3, as the first xor and second setz seems like it might be converted into one if possible.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
David542
  • 104,438
  • 178
  • 489
  • 842

2 Answers2

5

TL:DR: adding 1 flips the low bit, guaranteed, so you can use lea/and. See below.


You chose to write a whole function that returns a boolean integer instead of just creating a FLAGS condition (which is all most code needs: test $1, %dil and you're done; branch or cmov or setnz or setz or whatever you actually want to do based one the value being even).

If you are going to return an integer, then you don't actually need to get the condition into FLAGS and back out, especially if you want a "wide" return value. x86 setcc only writing the low byte is an inconvenient design that requires an extra xor-zeroing instruction most times you want to create a wider 0 / 1 integer. (I wish AMD64 had tidied up the design and changed the meaning of that opcode for 64-bit mode to setcc r/m32 but they didn't.)

You chose the semantics of your function to return 1 for even; that's opposite of the value of the low bit. (i.e. return (~x)&1;) You also chose to make a function using the x86-64 System V calling convention, imposing overhead from the calling convention taking an arg in a different register than you passed in.

This function is obviously way too trivial to be worth the call/return overhead; in real life you'd just inline and optimize this into the caller. So optimizing it as a stand-alone function is mostly a silly exercise, except for the idea of getting a 0/1 in a separate register from the original without destroying it.

If I was writing an answer on https://codegolf.stackexchange.com/, I'd follow this code-golf tip and choose my calling convention to pass an arg in EAX and return a boolean in AL (like gcc -m32 -mregparm=3 would). Or return a FLAGS condition in ZF. Or if allowed, choose my return semantics such that AL=0 meant even, AL=1 meant odd. Then

# gcc 32-bit regparm calling convention
is_even:          # input in RAX, bool return value in AL
    not   %eax             # 2 bytes
    and   $1, %al          # 2 bytes
    ret
# custom calling convention:
is_even:   # input in RDI
           # returns in ZF.  ZF=1 means even
    test  $1, %dil         # 4 bytes.  Would be 2 for AL, 3 for DL or CL (or BL)
    ret

2 instructions without destroying the input

is_even:
    lea   1(%rdi), %eax          # flip the low bit
    and   $1, %eax               # and isolate
    ret

XOR is add without carry. When the carry-in is zero (guaranteed for the low bit except with ADC), the result for a given bit is the same for XOR and addition. Check the truth table / gate-equivalent for a 1-bit "half adder" (no carry in): the "sum" output is literally just XOR, the carry output is just AND.

(XOR with 1 flips a bit, same as NOT.)

In this case, we don't care about the carry-out or any of the high bits (because we're about to nuke those bits with & 1 are the same operation), so we can use LEA as a copy-and-add to flip the low bit.

Using XOR instead of ADD or SUB is useful for SIMD, where pxor can run on more ports than paddb or psubb on CPUs before Skylake. When you want to range-shift unsigned to signed for pcmpgtb or something, you want to add -128, but that's the same thing as just flipping the high bit of each byte.


You could use this to flip a higher bit, e.g. lea 8(%rdi), %eax would flip the 1<<3 bit position (and potentially carry into all higher bits). We know the carry-in to that bit will be zero because x + 0 doesn't carry, and the low 3 bits of 8 are all 0.

(This idea is central to some of the later more interesting bit-hacks in https://catonmat.net/low-level-bit-hacks)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • that is so neat, thanks for explaining the `lea` trick. From the other answer, why doesn't the compiler do it that way? https://godbolt.org/z/xPGYfb. – David542 Oct 01 '20 at 05:40
  • @David542: Apparently nobody's taught GCC or clang's optimizer to look for that possible optimization. It's only relevant for an x86 back-end (where LEA is special), not in the architecture-neutral part of the compiler where most optimizations happen. Most ISAs that can copy-and-add can also copy-and-xor or copy-and-not. You could file missed-optimization bugs for GCC and clang, if I don't get around to it. – Peter Cordes Oct 01 '20 at 05:48
2

I can't get it down to two instructions, but I can golf it a little shorter.

Your current version is 12 bytes including the ret. You can shave off two bytes with test $1, %dil instead, since the high bytes of the input are irrelevant, thus trading the 4-byte immediate for a 1-byte immediate and a prefix byte. That gets it down to 10.

You can shave off two more bytes by taking advantage of the somewhat obscure fact that the shift instructions shift into the carry flag, and doing

is_even: // 8 bytes
    xor %eax, %eax
    shr $1, %edi
    setnc %al
    ret

gcc and clang both do

is_even: // 8 bytes
    mov %edi, %eax
    not %eax
    and $1, %eax
    ret

For one less byte, there's

is_even: // 7 bytes
    shr $1, %edi
    sbb %eax, %eax
    inc %eax
    ret

sbb is "subtract with borrow", which subtracts one register from another, then subtracts 1 more if the carry flag was set. This leaves us with 0 if the input was even, and -1 if it was odd. Then adding 1 gets us where we want to be. This might be slower because I'm not sure if the CPU knows that the result does not depend on the previous value of %eax.

I don't see a way to get down to two instructions, though. It's an annoying feature of the conditional setcc instructions that they only set the low byte and leave the rest of the register alone, forcing you to zero it yourself in the common case that you want your boolean in the full register. And we have to get the input and output in two different registers, which is awkward because of x86's model where the output register is always one of the inputs.

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • 1
    A more sane way to save 1 byte (avoiding `test r32, imm32` because there is no `test r32, imm8`) is `test $1, %dil`. But that still needs a REX prefix and immediate, unlike your shift. You're actually saving 5 bytes vs. the OP's original which uses `48 f7 c7 01 00 00 00 test rdi,0x1` instead of a simple `40 f6 c7 01 test dil,0x1` – Peter Cordes Sep 29 '20 at 04:32
  • 1
    I posted an answer with a 2-instruction way using LEA to copy-and-flip the low bit. – Peter Cordes Sep 29 '20 at 06:49