What's the purpose of looping "xorl %edx,%eax; shrl $1,%edx"?

Question

I have the following x86 assembly code:

  movl   8(%ebp), %edx  //get an argument from the caller
  movl   $0, %eax
  testl  %edx, %edx
  je     .L1            
.L2:                   // what's the purpose of this loop body?
  xorl   %edx, %eax
  shrl   $1, %edx
  jne    .L2
.L1:
  andl   $1, %eax

The corresponding C code that the textbook gives as follows

int f1(unsigned x)
{
    int y = 0;
    while(x != 0) {
        __________;
    }
    return __________;
 }

The book asks readers to fill the blank and answer the question of "What does it do?"

I can't combine the loop body in one C expression. I can tell what the loop body does, but I have no idea about its purpose. The textbook also says that %eax here stores the return value. So...what's the purpose of

andl  $1, %eax

I also have no idea.

`xorl` is an xor (`^`) operation, while `shrl` is a right shift (`>>`)... — Chris Dodd, Aug 11 '16 at 03:04
@ChrisDodd: I think the OP knows that, but didn't put the pieces together of looking at just the low bit of the register that the loop updates. — Peter Cordes, Aug 11 '16 at 03:34
@PeterCordes If the original poster knew that he would be able determine the equivalent C expression. You don't actually need to know the purpose the code to fill in the blanks. — Ross Ridge, Aug 11 '16 at 03:42
@RossRidge: I think they were stuck on trying to do it with a single C expression, rather than `y ^= x; x>>=1;` on one line. Or am I also missing some single-expression way to express that in C (i.e. not the comma operator or separate statements.) Of course, the reason I upvoted is for trying to grok what the function *does*, not just how to solve the fill-in-the-blank homework decompile question. — Peter Cordes, Aug 11 '16 at 03:46

score 6 · Accepted Answer · edited May 23 '17 at 12:22

It looks like the purpose of the whole loop is to XOR all the bits together in the 32-bit arg. i.e. calculate the parity.

Working backwards from the last instruction (and $1,%eax), we know that only the low bit of the result matters.

With that in mind, the xor %edx,%eax becomes clearer: xor the current low bit of %edx into %eax. The high garbage doesn't matter.

The shr loops until all of x's bits have been shifted out. We could always loop 32 times to get all the bits, but that would be less efficient than stopping once x is 0. (Because of how XOR works, we don't need to actual XOR in the 0 bits; that has no effect.)

Once we know what the function does, filling in the C becomes an exercise in clever / compact C syntax. I thought at first that y ^= (x>>=1); would fit inside the loop, but that shifts x before using it the first time.

The only way I see to do it in one C statement is with the , operator (which does introduce a sequence point, so it's safe to read x on the left side and modify it on the right side of a ,). So, y ^= x, x>>=1; fits.

Or, for more readable code, just cheat and put two statements on the same line with a ;.

int f1(unsigned x) {
    int y = 0;
    while(x != 0) {
        y ^= x;  x>>=1;      
    }
    return y & 1;
 }

This compiles to essentially the same asm as shown in the question, using gcc5.3 -O3 on the Godbolt compiler explorer. The question's code de-optimizes the xor-zeroing idiom to a mov $0, %eax, and optimizes gcc's silly duplication of ret instructions. (Or maybe used an earlier version of gcc that didn't do that.)

The loop is very inefficient: this is an efficient way:

We don't need a loop with O(n) complexity (where n is the width in bits of x). Instead, we can get O(log2(n)) complexity, and actually take advantage of x86 tricks to only do the first 2 steps of that.

I've left off the operand-size suffix for instructions where it's determined by the registers. (Except for xorw to make the 16-bit xor explicit.)

#untested
parity:
    # no frame-pointer boilerplate

    xor       %eax,%eax        # zero eax (so the upper 24 bits of the int return value are zeroed).  And yes, this is more efficient than mov $0, %eax
                               # so when we set %al later, the whole of %eax will be good.

    movzwl    4(%esp), %edx      # load low 16 bits of `x`.  (zero-extend into the full %edx is for efficiency.  movw 4(%esp), %dx would work too.
    xorw      6(%esp), %dx       # xor the high 16 bits of `x`
    # Two loads instead of a load + copy + shift is probably a win, because cache is fast.
    xor       %dh, %dl           # xor the two 8 bit halves, setting PF according to the result
    setnp      %al               # get the inverse of the CPU's parity flag.  Remember that the rest of %eax is already zero, so the result is already zero-extended to 32-bits (int return value)
    ret

Yes, that's right, x86 has a parity flag (PF) that's updated from the low 8 bits of the result of every instruction that "sets flags according to the result", like xor.

We use the np condition because PF = 1 means even parity: xor of all bits = 0. We need the inverse to return 0 for even parity.

To take advantage of it, we do a SIMD-style horizontal reduction by bringing the high half down to the low half and combining, repeating twice to reduce 32 bits to 8 bits.

Zeroing eax (with an xor) before the instruction that sets flags is slightly more efficient than doing set-flags / setp %al / movzbl %al, %eax, as I explained in What is the best way to set a register to zero in x86 assembly: xor, mov or and?.

Or, as @EOF points out, if the CPUID POPCNT feature bit is set, you can use popcnt and test the low bit to see if the number of set bits is even or odd. (Another way to look at this: xor is add-without-carry, so the low bit is the same whether you xor all the bits together or add all the bits together horizontally).

GNU C also has __builtin_parity and __builtin_popcnt which use the hardware instruction if you tell the compiler that the compile target supports it (with -march=... or -mpopcnt), but otherwise compile to an efficient sequence for the target machine. The Intel intrinsics always compile to the machine instruction, not a fallback sequence, and it's a compile-time error to use them without the appropriate -mpopcnt target option.

Unfortunately gcc doesn't recognize the pure-C loop as being a parity calculation and optimize it into this. Some compilers (like clang and probably gcc) can recognize some kinds of popcount idioms, and optimize them into the popcnt instruction, but that kind of pattern recognition doesn't happen in this case. :(

See these on godbolt.

int parity_gnuc(unsigned x) {
    return  __builtin_parity(x);
}
    # with -mpopcnt, compiles the same as below
    # without popcnt, compiles to the same upper/lower half XOR algorithm I used, and a setnp
    # using one load and mov/shift for the 32->16 step, and still %dh, %dl for the 16->8 step.

#ifdef __POPCNT__
#include <immintrin.h>
int parity_popcnt(unsigned x) {
    return  _mm_popcnt_u32(x) & 1;
}
#endif

    # gcc does compile this to the optimal code:
    popcnt    4(%esp), %eax
    and       $1, %eax
    ret

See also other links in the x86 tag wiki.

If you would be able to modify the code more, it would be possible to do it in simpler (1 less "statement") C++ source `int y = x; while (x>>=1) { y ^= x; } return y&1;` (while staying true to the original stupid way of calculating parity by xoring each bit) ... but curiously according to godbolt it does procude a tiny bit worse machine code (one more `mov eax,edi` to your version). Don't want to discuss it here, but why I always expect more from compilers? :/ Like why it does not align the loop by moving the function start, but it puts NOP inside the fn body instead? I don't get it. :D — Ped7g, Aug 11 '16 at 09:44
But the optimized method returns the opposite value? The PF is set when bits are even, while the original function returns 0 for even bits, if I didn't do some mistake. BTW, I didn't know the PF was affected only by low 8 bits till today... (after writing well over 1+MB of x86 ASM sources over decade) ... The learning process never stops.. :D — Ped7g, Aug 11 '16 at 09:57
Well, since you mention the assembly equivalent, I think it's also worth mentioning compiler-specific extensions such as `__builtin_parity` and/or `__builtin_popcount` at least for GCC et al. Good answer btw. — edmz, Aug 11 '16 at 15:06
@Ped7g: thanks for the bugfix on the optimized version. A good efficient algorithm was much more interesting than annoying complications like looking up the sense of PF to avoid inverting the answer :P re: alignment: I don't think gcc even knows instruction sizes, and just has some tuneable heuristics for emitting `.p2align`. After reporting a few missed-optimization bugs, I've realized that compilers aren't "smart", they're just very complicated but still dumb machines. I agree it's disappointing that compilers aren't smart, but fast compile times preclude more than polynomial complexity. — Peter Cordes, Aug 11 '16 at 15:15
@black: good point, the GNU C __builtin functions still work without enabling `-mpopcnt`. — Peter Cordes, Aug 11 '16 at 15:39
@PeterCordes: Yes, the fact it doesn't need SSE support makes it more portable across platforms. OTOH, if you need it more portable across _compilers_, SSE is better given the common-agreed interface, as you say (assuming hardware support, clearly). — edmz, Aug 11 '16 at 15:50
@black: Good point. `popcnt` is separate from SSE (see the link about the CPUID feature bit), but yes, it's x86-only. However, if you compile without `-mpopcnt`, you won't take advantage of the `popcnt` instruction even on hardware that does support it. This is where a JIT-compiled language like Java has an advantage, since dynamic dispatching for tiny functions isn't viable (you could maybe dynamic dispatch larger functions that inline the parity function). — Peter Cordes, Aug 11 '16 at 16:00

What's the purpose of looping "xorl %edx,%eax; shrl $1,%edx"?

1 Answers1

The loop is very inefficient: this is an efficient way:

Linked

Related