Clang does not zero the upper bits when loading a byte. Is this a bug or a deliberate choice?

Question

For example, with this function,

void mask_rol(unsigned char *a, unsigned char *b) {
    a[0] &= __rolb(-2, b[0]);
    a[1] &= __rolb(-2, b[1]);
    a[2] &= __rolb(-2, b[2]);
    a[3] &= __rolb(-2, b[3]);
    a[4] &= __rolb(-2, b[4]);
    a[5] &= __rolb(-2, b[5]);
    a[6] &= __rolb(-2, b[6]);
    a[7] &= __rolb(-2, b[7]);
}

gcc produces,

mov     edx, -2
mov     rax, rdi

movzx   ecx, BYTE PTR [rsi]
mov     edi, edx
rol     dil, cl
and     BYTE PTR [rax], dil
...

While I don't understand why it is filling dx and ax, this is from clang.

mov     cl, byte ptr [rsi]
mov     al, -2
rol     al, cl
and     byte ptr [rdi], al
...

It doesn't do seemingly unnecessary movs like gcc, but it also doesn't care about clearing the upper bits using movzx.

As far as I know, the reason gcc does movzx is to remove false dependency from the dirty upper bits, but maybe clang also has a reason not to do it, so I ran a simple benchmark, and this is the result.

$ time ./rol_gcc
 2161860550

real    0m0.895s
user    0m0.877s
sys     0m0.002s

$ time ./rol_clang
 3205979094

real    0m1.328s
user    0m1.311s
sys     0m0.001s

At least in this case, clang's approach seems to be wrong.

Is this clearly clang's bug, or are there some cases in which clang's approach could produce more efficient code?

benchmark code

#include <stdio.h>
#include <x86intrin.h>

__attribute__((noinline))
static void mask_rol(unsigned char *a, unsigned char *b) {
    a[0] &= __rolb(-2, b[0]);
    a[1] &= __rolb(-2, b[1]);
    a[2] &= __rolb(-2, b[2]);
    a[3] &= __rolb(-2, b[3]);
    a[4] &= __rolb(-2, b[4]);
    a[5] &= __rolb(-2, b[5]);
    a[6] &= __rolb(-2, b[6]);
    a[7] &= __rolb(-2, b[7]);
}

static unsigned long long rdtscp() {
    unsigned _;
    return __rdtscp(&_);
}

int main() {
    unsigned char a[8] = {0}, b[8] = {7, 0, 6, 1, 5, 2, 4, 3};
    unsigned long long c = rdtscp();
    for (int i = 0; i < 300000000; ++i) {
        mask_rol(a, b);
    }
    printf("%11llu\n", rdtscp() - c);
    return 0;
}

A web search on `__rolb` shows different impl in `ia32intrin.h` for clang and gcc. Each compiler provides a its own version of the file — Craig Estey, Jan 25 '22 at 22:48
clang/LLVM is reckless in general about false dependencies. It tries to avoid creating loop-carried dep chains in a loop *inside a single functions*, I think (which you've defeated by making this small frequently-called function `noinline`). But for some reason they choose to save a byte of code size instead of avoiding false dependencies for integer regs in other cases. (Not worth it IMO, unlike sometimes avoiding a whole instruction for xor-zeroing integer or vector regs.) — Peter Cordes, Jan 26 '22 at 01:53
Near duplicate of [Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster?](https://stackoverflow.com/q/60688348) - a non-inline function call creates a loop-carried dep chain due to clang's cavalier attitude towards false dependencies. (In that case on XMM registers, rather than scalar int where P6-family partial register renaming would actually break the false dep. But not on Haswell and later: [Why doesn't GCC use partial registers?](https://stackoverflow.com/q/41573502)). — Peter Cordes, Jan 26 '22 at 02:01
So yeah, it's a clang missed-optimization bug, or a case where it's heuristics didn't pay off. — Peter Cordes, Jan 26 '22 at 02:02

score 2 · Accepted Answer · answered Jan 26 '22 at 03:02

clang/LLVM is reckless in general about false dependencies. It tries to avoid creating loop-carried dep chains in a loop inside a single functions, I think, but you've defeated that by making this small frequently-called function noinline.

Avoiding a whole instruction for xor-zeroing integer or vector regs might be worth the risk sometimes, but saving 1 byte of code for mov al over movzx eax seems a lot less worth the risk. All x86 CPUs have had efficient movzx loads for many years now.

Near duplicate of Why does adding an xorps instruction make this function using cvtsi2ss and addss ~5x faster? - a non-inline function call creates a loop-carried dep chain due to clang's cavalier attitude towards false dependencies. In that case on XMM registers, rather than scalar int where P6-family partial register renaming would actually break the false dep, also on Sandybridge. But not on Haswell and later, which doesn't rename low8 separately from full registers: Why doesn't GCC use partial registers?

So yeah, it's a clang missed-optimization bug, or a case where it's heuristics didn't pay off. I'm curious how much difference it would make (positive or negative) for clang to always use movzx for narrow loads, in code that doesn't need it to avoid loop-carried false dependencies.

Clang should probably change this if any downside is tiny across the board on different CPU types, or at least balanced by the big upside of avoiding slowdowns like this. (And performance upside by needing fewer back-end uops taking space in the RS by not having to load+merge, just load. Modern Intel decodes mov al, mem as a micro-fused load+ALU.)

Or if for some reason an always-movzx strategy isn't better in general, it should still use one somewhere in this long non-looping dep chain, like at least one in the middle for each of AL and CL, to create more ILP even if the function only runs once. And/or alternate AL and DL or something. (clang 13 surprisingly uses DL for the last byte, but AL for the previous 7: https://godbolt.org/z/7PYWGxsse - in future questions it would be a good idea to include your own compiler explorer link with versions / options matching what you tested.)

While I don't understand why it is filling dx and ax

It looks like GCC is reusing the same -2 constant, using mov edi, edx (2 bytes) eight times instead of mov edi, -2 (5 bytes) eight times. Maybe code-size wasn't the reason, because GCC normal will spend code-size to save instructions. IDK.

Also, GCC's register allocation is sometimes sub-optimal around hard-register constraints like function args and return values. So yeah, it's just wasting instructions copying the incoming pointer to RAX. The function doesn't return it. And dil is a dumb choice for a register to rotate; it needs a REX prefix when al or dl wouldn't.

Clang does not zero the upper bits when loading a byte. Is this a bug or a deliberate choice?

1 Answers1

Linked