Data-dependent branching defeats auto-vectorization in GCC/Clang (but not classic ICC). The trip-count can't be computed before the first iteration (in the abstract machine), so GCC and clang wouldn't even try to use pcmpeqb
/pmovmskb
for large sizes. (Which is the efficient way to memcmp
on x86-64 for large inputs.)
Apparently this is a hard problem, as it's been unsolved in GCC/Clang for 15 to 20 years (from now back to whenever GCC first started doing auto-vectorization.)
See also how to auto vectorization array comparison function - writing it as a loop that counts mismatches and always touches all elements can give auto-vectorization. (Or an OR reduction instead of a sum reduction). But that won't help for a small fixed size like 4 bytes. And removing the early-out entirely is a disaster for a 1GiB memcmp with a difference in the first byte. (A good SIMD memcmp like glibc's SSE2/AVX2 versions would branch maybe every 64 to 128 bytes on a vector compare results.)
Apparently there's no idiom-recognition either, at least not for this way of writing it. (There is for memcpy
; GCC and clang can turn simple copy loops into call memcpy
or call memset
, or inline expansions of those functions. So if you're writing your own, you need -fno-builtin-memcpy
or else your function turns into a call to itself... Pick a different name for experiments.)
Writing it as a loop that unconditionally touches all bytes could give auto-vectorization, but probably won't get recognized as a memcmp
idiom in GCC's internals. I think that would be necessary for good code with small problems, like this case where we want a single dword cmp
.
Compilers must avoid introducing segfaults by inventing reads past where the abstract machine stops. If void *data
points to a 1-byte buffer holding 'z'
, your manual loop has well-defined behaviour in the abstract machine. Reading all 4 bytes would be past the end of the buffer.
But memcmp
is UB if any part of the array is inaccessible, so the compiler can touch all 4 bytes without checking for early-out or for pointer alignment. (Can std::memcmp read any bytes past the first difference? yes, unlike your loop.)
(In x86-64 asm, going past the end could go into an unmapped page, resulting in a segfault, violating the as-if rule. Is it safe to read past the end of a buffer within the same page on x86 and x64? - yes, but only within the same page. This is something you can work around with aligned loads for strlen
and strchr
, but a bigger obstacle for vectorizing strcmp
with differently-aligned pointers.)
Instead of comparing two unknown pointers from function args, I changed your test_data
caller to pass pointers to two global arrays, char foo[4], bar[4];
, which the compiler knows for certain are both readable. (Godbolt). But that didn't help, so still no.