3

I am looking at the following piece of x86 assembly code (Intel syntax):

movzx   eax, al
and     eax, 3
cmp     eax, 3
ja      loc_6BE9A0

In my understanding, this should equal something like this in C:

eax &= 0xFF;
eax &= 3;
if (eax > 3)
   loc_6BE9A0();

This does not seem to make much sense since this condition will never be true (because eax will never be greater than 3 if it got and-ed with 3 before). Am I missing something here or is this really just an unnecessary condition?

And also: the movzx eax, al should not be necessary either if it gets and-ed with 3 right after that, is it?

I am asking this because I am not so familiar with assembly language and so I am not entirely sure if I am missing something here.

Kebberling
  • 72
  • 1
  • 7

2 Answers2

5

You're correct: the movzx is redundant given the following and. It might have been produced by a non-optimizing compiler.

And yes, if this code executes straight through then the ja jump will never be taken. However, the cmp/ja may not be completely useless if there is code somewhere else that jumps straight to the cmp (or even to the ja).

Nate Eldredge
  • 48,811
  • 6
  • 54
  • 82
  • It's *possible* the movzx is there to avoid partial-register stalls on Intel P6-family, if earlier code only wrote AL and something actually needs to read the full register later (so this code couldn't have just used `and al, 3`). But given the rest of the code-gen, it's probably just a silly compiler, possible MSVC in debug mode. (GCC -O0 doesn't optimize between statements, but it's not usually this dumb within a single statement. Neither is clang. I'm assuming single-statement otherwise the var would spill/reload to the stack in a debug build.) – Peter Cordes Apr 18 '21 at 23:32
  • I was wrong, `gcc -O0` *will* emit that sequence when `al` comes from a function return value of `unsigned char`, but not when derefing an `unsigned char*`. – Peter Cordes Apr 19 '21 at 02:16
2

This is redundant, and not something you'd see in optimized asm.

Even if the cmp/ja was a possible jump target from somewhere else, existing optimizing compilers like GCC, clang, MSVC, and ICC would (I'm pretty sure) do a jmp or different code layout instead of letting execution fall into a conditional branch that would always be false. The optimizer would know there doesn't need to be a conditional branch along this path of execution, so would make sure it didn't encounter one. (Even if that cost an additional jmp.)

That's probably a good choice, even in the hypothetical case where some code-size saving was possible this way, because you don't want to pollute / dilute branch-prediction history with unnecessary conditional branches, and the branch could mispredict as taken.


But in debug mode, some compilers are more able to switch off their brains than others for optimizations within a single statement or expression. (Across statements they'd always spill/reload vars to memory, unless you used register int foo;)

I was able to trick clang -O0 and MSVC into emitting that exact sequence of instructions. And also something like that but worse from GCC. (Surprising because gcc -O0 still does some optimizations inside a single expression like using a multiplicative inverse for x /= 10;, and dead code removal for if(false). vs. MSVC actually putting a 0 in a register and testing that it's 0.)

void dummy();
unsigned char set_al();
int foo(void) {
    if ((set_al() & 3) <= 3U)
        dummy();
    return 0;
}

clang12.0 for x86-64 Linux (on Godbolt)

        push    rbp
        mov     rbp, rsp
        call    set_al()
        movzx   eax, al            # The redundant sequence
        and     eax, 3
        cmp     eax, 3
        ja      .LBB0_2
        call    dummy()
.LBB0_2:
        xor     eax, eax
        pop     rbp
        ret

MSVC contained the same sequence. GCC10.3 was similar but worse, materializing a boolean in a register and testing it. (Both also in the same Godbolt link)

## GCC10.3
 ... set up RBP as a frame pointer
        movzx   eax, al            # The redundant sequence
        and     eax, 3
        cmp     eax, 3
        setbe   al
        test    al, al             # even worse than just jnbe
        je      .L2
        call    dummy()
.L2:
        mov     eax, 0
        pop     rbp
        ret

With the char coming from memory instead of a return value, GCC does optimize away the compare even in debug mode:

int bar(unsigned char *p) {
    if ((*p & 3) <= 3U)
        dummy();
    return 0;
}
# GCC 10.3 -O0
bar(unsigned char*):
        push    rbp
        mov     rbp, rsp
        sub     rsp, 16                # space to spill the function arg
        mov     QWORD PTR [rbp-8], rdi
        call    dummy()                # unconditional call
        mov     eax, 0
        leave
        ret

clang and MSVC do the test, both with asm like

#MSVC19.28 (VS16.9)  default options (debug mode)
     ...
        movzx   eax, BYTE PTR [rax]
        and     eax, 3
        cmp     eax, 3
        ja      SHORT $LN2@bar
     ...
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847