8

I have following C snippet:

int main() {

    int tablica [100];
    bool visited [100];
    int counter;
    int i;

    for(i=0;i<=99;i++) {
        if (visited[i]==0) {
            counter=counter+1;
        }
    }

}

which I converted to assembler. I received the following output:

   ; ...

    mov     eax, DWORD PTR [rbp-8]
    cdqe
    movzx   eax, BYTE PTR [rbp-528+rax]
    xor     eax, 1
    test    al, al
    je      .L3

    ; ...

Could anybody explain to me what is the meaning and purpose of the CDQE and MOVZX instructions in this code? I also don't understand what the use of the XOR instruction is.

Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
BartekS
  • 127
  • 1
  • 1
  • 7
  • 2
    [cdqe](https://www.felixcloutier.com/x86/cbw:cwde:cdqe) [movzx](https://www.felixcloutier.com/x86/movzx) – tkausl Feb 10 '19 at 16:50
  • 4
    It’s usually counterproductive to look at assembly code generated by a compiler with optimization disabled. – prl Feb 11 '19 at 01:41
  • you should understand how bitwise operations work before dealing with assembly. [tag:bitwise-xor], [tag:bit-manipulation] – phuclv Feb 11 '19 at 04:01

1 Answers1

22

The CDQE instruction sign-extends a DWORD (32-bit value) in the EAX register to a QWORD (64-bit value) in the RAX register.

The MOVZX instruction zero-extends the source to the destination. In this case, it zero-extends the BYTE loaded from memory at [rbp-528+rax] to the DWORD destination register, EAX.

The XOR eax, 1 instruction just flips the lowest bit of EAX. If it is currently set (1), then it becomes clear (0). If it is currently clear (0), then it becomes set (1).

What is the big picture? Well, it turns out that this is almost completely pointless code, the kind of output you get from a compiler without optimizations enabled. It serves very little purpose to try and analyze that.

But, if you want, we can analyze it anyway. Here is the entire assembly output for your C code, as generated by GCC 8.2 at -O0, with each instruction annotated:

main():
        push    rbp                         ; \ standard function
        mov     rbp, rsp                    ; /  prologue code
        sub     rsp, 408                    ; allocate space for stack array
        mov     DWORD PTR [rbp-8], 0        ; i = 0
.L4:
        cmp     DWORD PTR [rbp-8], 99       ; is i <= 99?
        jg      .L2                         ; jump to L2 if i > 99; otherwise fall through
        mov     eax, DWORD PTR [rbp-8]      ; EAX = i
        cdqe                                ; RAX = i
        movzx   eax, BYTE PTR [rbp-528+rax] ; EAX = visited[i]
        xor     eax, 1                      ; flip low-order bit of EAX (EAX ^= 1)
        test    al, al                      ; test if low-order bit is set?
        je      .L3                         ; jump to L3 if low-order bit is clear (== 0)
                                            ;  (which means it was originally set (== 1),
                                            ;   which means visited[i] != 0)
                                            ; otherwise (visited[i] == 0), fall through
        add     DWORD PTR [rbp-4], 1        ; counter += 1
.L3:
        add     DWORD PTR [rbp-8], 1        ; i += 1
        jmp     .L4                         ; unconditionally jump to top of loop (L4)
.L2:
        mov     eax, 0                      ; EAX = 0 (EAX is result of main function)
        leave                               ; function epilogue
        ret                                 ; return

Neither an assembly programmer nor an optimizing compiler would produce this code. It makes extremely ineffective use of the registers (preferring to load and store to memory, including values like i and the counter, which are prime targets for storing in registers), and it has a lot of pointless instructions.

Of course, an optimizing compiler would really do a number on this code, eliding it completely since it has no observable side-effects. The output would just be:

main():
        xor     eax, eax    ; main will return 0
        ret

That's not so interesting to analyze, but far more efficient. That's why we pay our C compilers the big bucks.

The C code also has undefined behavior in these lines:

int counter;
/* ... */
counter=counter+1;

You never initialize counter, but then you try to read from it. Since it is a variable with automatic storage duration, its contents are not automatically initialized, and reading from an uninitialized variable is undefined behavior. This justifies a C compiler emitting any assembly code that it wants.

Let's assume that counter is initialized to 0, and we were to write this assembly code by hand, ignoring the possibility of eliding the whole mess. We'd get something like:

main():
        mov     edx, OFFSET visited             ; EDX = &visited[0]
        xor     eax, eax                        ; EAX = 0
MainLoop:
        cmp     BYTE PTR [rdx], 1               ; \ EAX += (*RDX == 0) ? 1
        adc     eax, 0                          ; /                    : 0
        inc     rdx                             ; RDX += 1
        cmp     rdx, OFFSET visited + 100       ; is *RDX == &visited[100]?
        jne     MainLoop                        ; if not, keep looping; otherwise, done
        ret                                     ; return, with result in EAX

What happened? Well, the calling convention says that EAX always holds the return value, so I've put counter in EAX and assumed that we are returning counter from the function. RDX is a pointer tracking the current position in the visited array. It gets incremented by 1 (size of a BYTE) throughout the MainLoop. With that in mind, the rest of the code should be straightforward, except for the ADC instruction.

This is an add-with-carry instruction, used to write the conditional if inside of the loop branchlessly. An ADC performs the following operation:

destination = (destination + source + CF)

where CF is the carry flag. The CMP instruction right before it sets the carry flag if visited[i] == 0, and the source is 0, so it does just what I commented to the right of the instruction: it adds 1 to EAX (counter) if *RDX == 0 (visited[i] == 0); otherwise, it adds 0 (which is a no-op).

If you wanted to write branchy code, you'd do:

main():
        mov     edx, OFFSET visited             ; EDX = &visited[0]
        xor     eax, eax                        ; EAX = 0
MainLoop:
        cmp     BYTE PTR [rdx], 0               ; (*RDX == 0)?
        jne     Skip                            ; if not, branch to Skip; if so, fall through
        inc     eax                             ; EAX += 1
Skip:
        inc     rdx                             ; RDX += 1
        cmp     rdx, OFFSET visited + 100       ; is *RDX == &visited[100]?
        jne     MainLoop                        ; if not, keep looping; otherwise, done
        ret                                     ; return, with result in EAX

This works just as well, but depending on how predictable the values of the visited array are, may be slower due to branch prediction failure.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Cody Gray - on strike
  • 239,200
  • 50
  • 490
  • 574
  • If you actually wanted to help compilers optimize this, you'd do unconditional `notcounted += visited[i];` inside the loop, and `counter = 100-notcounted;` outside the loop. That should hopefully auto-vectorize nicely, although I wouldn't be surprised if compilers will be stupid and unpack bytes to dwords before adding even though the loop bound means there's no chance for overflow. And also fail to use `psadbw` for a horizontal sum of bytes. – Peter Cordes Feb 11 '19 at 03:23
  • Also, a function like `int count(bool *visited, size_t length)` could avoid optimizing away, and would be more interesting to look at. Maybe that's how you could introduce the hand-written asm section. https://godbolt.org/z/-nuUVA : We get an interesting selection of asm from the 4 major x86 compilers. clang and ICC auto-vectorize, both similar to the way I pessimistically expected, widening to dword early. gcc uses `adc eax,0` like you did. – Peter Cordes Feb 11 '19 at 03:28
  • clang also uses `pshufb` in there somewhere. Looks bad. With `-fno-unroll-loops` https://godbolt.org/z/Xc4zNR we can see it's using PXOR to invert the boolean before summing. But the pmovzxbd + pshufb on that to grab the low byte of each dword, and feed that to another pmovzxbd just looks insane. Oh, I think it's unpacking to dword to flip the boolean, then repacking, then unpacking to add. /facepalm – Peter Cordes Feb 11 '19 at 03:38
  • clang bugs reported: [40685](https://bugs.llvm.org/show_bug.cgi?id=40685) about the 3x shuffles. [40686](https://bugs.llvm.org/show_bug.cgi?id=40686) about the other missed optimizations: with AVX2 this can pretty easily run at ~2x 32-byte loads per clock, unrolling by 4 or so for large lengths, using what I suggested in my first comment. Or for exactly 100 elements, 3x 32 + 4 bytes is easy to fully unroll. – Peter Cordes Feb 11 '19 at 07:38