Is movzbl followed by testl faster than testb?

Question

Consider this C code:

int f(void) {
    int ret;
    char carry;

    __asm__(
        "nop # do something that sets eax and CF"
        : "=a"(ret), "=@ccc"(carry)
    );

    return carry ? -ret : ret;
}

When I compile it with gcc -O3, I get this:

f:
        nop # do something that sets eax and CF
        setc    %cl
        movl    %eax, %edx
        negl    %edx
        testb   %cl, %cl
        cmovne  %edx, %eax
        ret

If I change char carry to int carry, I instead get this:

f:
        nop # do something that sets eax and CF
        setc    %cl
        movl    %eax, %edx
        movzbl  %cl, %ecx
        negl    %edx
        testl   %ecx, %ecx
        cmovne  %edx, %eax
        ret

That change replaced testb %cl, %cl with movzbl %cl, %ecx and testl %ecx, %ecx. The program is actually equivalent, though, and GCC knows it. As evidence of this, if I compile with -Os instead of -O3, then both char carry and int carry result in the exact same assembly:

f:
        nop # do something that sets eax and CF
        jnc     .L1
        negl    %eax
.L1:
        ret

It seems like one of two things must be true, but I'm not sure which:

A testb is faster than a movzbl followed by a testl, so GCC's use of the latter with int is a missed optimization.
A testb is slower than a movzbl followed by a testl, so GCC's use of the former with char is a missed optimization.

My gut tells me that an extra instruction will be slower, but I also have a nagging doubt that it's preventing a partial register stall that I just don't see.

By the way, the usual recommended approach of xoring the register to zero before the setc doesn't work in my real example. You can't do it after the inline assembly runs, since xor will overwrite the carry flag, and you can't do it before the inline assembly runs, since in the real context of this code, every general-purpose call-clobbered register is already in use somehow.

There is also an alternate sequence for conditional negation based on CF (shorter than this), do you consider that on topic? — harold, Jun 19 '20 at 21:02
@harold [This one](https://stackoverflow.com/questions/62477260/is-movzbl-followed-by-testl-faster-than-testb?noredirect=1#comment110493646_62477430)? — Joseph Sible-Reinstate Monica, Jun 19 '20 at 21:03
That's a nice one as well, but what I was thinking of was `sbb / xor / sub` — harold, Jun 19 '20 at 21:04
@harold Figuring out which operands went where to make that work was a nice brain exercise (looks like `sbb %ecx, %ecx`; `xor %ecx, %eax`; `sub %ecx, %eax`). Anyway, yes, I'm definitely looking for those kind of clever ideas. I wonder if processors are smart enough to avoid a false dependency on `ecx` though. — Joseph Sible-Reinstate Monica, Jun 19 '20 at 21:18
AMDs processors are (according to Agner Fog), Intels still read ecx as far as I know — harold, Jun 19 '20 at 21:39

score 3 · Accepted Answer · answered Jun 19 '20 at 19:36

3

There's no downside I'm aware of to reading a byte register with test vs. movzb.

If you are going to zero-extend, it's also a missed optimization not to xor-zero a reg ahead of the asm statement, and setc into that so the cost of zero-extension is off the critical path. (On CPUs other than Intel IvyBridge+ where movzx r32, r8 is not zero latency). Assuming there's a free register, of course. Recent GCC does sometimes find this zero/set-flags/setcc optimization for generating a 32-bit boolean from a flag-setting instruction, but often misses it when things get complex.

Fortunately for you, your real use-case couldn't do that optimization anyway (except with mov $0, %eax zeroing, which would be off the critical path for latency but cause a partial-register stall on Intel P6 family, and cost more code size.) But it's still a missed optimization for your test case.

answered Jun 19 '20 at 19:36

Peter Cordes

328,167
45
605
847

Now I'm wondering whether `leal -1(%eax), %edx`; `notl %edx`; `cmovcl %edx, %eax` would be faster than messing around saving and re-testing flags. (And I'm also really starting to doubt "the compiler is better at optimizing than you are; let it do its job".) – Joseph Sible-Reinstate Monica Jun 19 '20 at 20:43
1

@JosephSible-ReinstateMonica: yeah, certainly worth considering. And yes, compilers are very good at a large scale, doing constant-propagation and inlining in seconds to create asm that would be unmaintainable by hand and/or take years to write. But over a small scale like a single loop or function, compilers are often not optimal. That advice applies to most people most of the time, but often not people who have read and understood Agner Fog's microarch PDF (https://agner.org/optimize/). Of course the downside to actually writing in asm is that what's good on future CPUs might be different – Peter Cordes Jun 19 '20 at 20:46
@JosephSible-ReinstateMonica: keep in mind that compile speed is still a factor so compilers usually need to avoid algorithms with O(e^N) or worse time complexity, unless N is small and fixed (like the number of registers in the machine). So if you want optimal code for current CPUs, you need to either hand-hold the compiler into doing better ([C++ code for testing the Collatz conjecture faster than hand-written assembly - why?](https://stackoverflow.com/q/40354978)), or just write asm by hand. Especially for stuff like extended-precision where you want an `adc` chain. – Peter Cordes Jun 19 '20 at 20:51

Is movzbl followed by testl faster than testb?

1 Answers1