What prevents the compiler from optimizing a hand written memcmp()?

Question

Given:

#include <string.h>

bool test_data(void *data)
{
    return memcmp(data, "abcd", 4) == 0;
}

The compiler can optimize it into:

test_data:
    cmpl    $1684234849, (%rdi)
    sete    %al
    ret

which is nice.

But if I use my own memcmp() (not from <string.h>), the compiler can't optimize it into a single cmpl instruction. Instead, it does this:

static int memcmp(const void *s1, const void *s2, size_t n)
{
    const unsigned char *p1 = s1, *p2 = s2;
    size_t i;

    for (i = 0; i < n; i++) {
        int ret = p1[i] - p2[i];
        if (ret)
            return ret;
    }

    return 0;
}

bool test_data(void *data)
{
    return memcmp(data, "abcd", 4) == 0;
}

test_data:
    cmpb    $97, (%rdi)
    jne     .L5
    cmpb    $98, 1(%rdi)
    jne     .L5
    cmpb    $99, 2(%rdi)
    jne     .L5
    cmpb    $100, 3(%rdi)
    sete    %al
    ret
.L5:
    xorl    %eax, %eax
    ret

Link: https://godbolt.org/z/Kfhchr45a

What prevents the compiler from further optimizing it?
Did I do something that inhibits the optimization?

I'm not sure that godbolt is telling you what you think it is. Try adding `int main() { return test_data("abcd"); }`. It optimizes main down to `movl $1, %eax`. — David Wohlferd, Sep 01 '23 at 04:55
@DavidWohlferd: Yes, GCC can do constant-folding through a loop when *both* inputs are constant. But that's not the question; the OP is looking at whether the compiler can optimize when the size is a known small constant. All we get is a fully-unrolled chain of compare/branch, which sucks horribly compared to a single dword `cmp`. — Peter Cordes, Sep 01 '23 at 05:13
Not directly answering your question, but compilers usually have special intrinsics for functions like `memcmp` and `memcpy` and even if the intrinsics are not used, `memcmp` is often written in assembly in the source code (You see `memcmp.S` in the source code of glibc for x86, for example, not `memcmp.c`). — Weijun Zhou, Sep 01 '23 at 05:14
@WeijunZhou: Yes, true. And unless you use `-fno-builtin`, GCC treats `memcmp` as `__builtin_memcmp`; that's how it turns it into a single `cmp` instruction rather than a `call memcmp`. Note that inline expansions of `memcpy` / `memcmp` are totally separate from the hand-written asm implementations in `libc.so` and the glibc source code. GCC has its own patterns for expanding "string" functions. (It used to sometimes use `repne scasb` for `strlen`, at least at `-O1`, but stopped because [that's very slow for large buffers](https://stackoverflow.com/a/55563916).) — Peter Cordes, Sep 01 '23 at 05:20
Optimisers do not perform miracles, they match patterns in code that someone put in. The pattern of your handwritten memcp is not in, that's all. — n. m. could be an AI, Sep 01 '23 at 06:12

Peter Cordes · Accepted Answer · 2023-09-01T06:27:29.263

Data-dependent branching defeats auto-vectorization in GCC/Clang (but not classic ICC). The trip-count can't be computed before the first iteration (in the abstract machine), so GCC and clang wouldn't even try to use pcmpeqb/pmovmskb for large sizes. (Which is the efficient way to memcmp on x86-64 for large inputs.)

Apparently this is a hard problem, as it's been unsolved in GCC/Clang for 15 to 20 years (from now back to whenever GCC first started doing auto-vectorization.)

See also how to auto vectorization array comparison function - writing it as a loop that counts mismatches and always touches all elements can give auto-vectorization. (Or an OR reduction instead of a sum reduction). But that won't help for a small fixed size like 4 bytes. And removing the early-out entirely is a disaster for a 1GiB memcmp with a difference in the first byte. (A good SIMD memcmp like glibc's SSE2/AVX2 versions would branch maybe every 64 to 128 bytes on a vector compare results.)

Apparently there's no idiom-recognition either, at least not for this way of writing it. (There is for memcpy; GCC and clang can turn simple copy loops into call memcpy or call memset, or inline expansions of those functions. So if you're writing your own, you need -fno-builtin-memcpy or else your function turns into a call to itself... Pick a different name for experiments.)

Writing it as a loop that unconditionally touches all bytes could give auto-vectorization, but probably won't get recognized as a memcmp idiom in GCC's internals. I think that would be necessary for good code with small problems, like this case where we want a single dword cmp.

Compilers must avoid introducing segfaults by inventing reads past where the abstract machine stops. If void *data points to a 1-byte buffer holding 'z', your manual loop has well-defined behaviour in the abstract machine. Reading all 4 bytes would be past the end of the buffer.

But memcmp is UB if any part of the array is inaccessible, so the compiler can touch all 4 bytes without checking for early-out or for pointer alignment. (Can std::memcmp read any bytes past the first difference? yes, unlike your loop.)

(In x86-64 asm, going past the end could go into an unmapped page, resulting in a segfault, violating the as-if rule. Is it safe to read past the end of a buffer within the same page on x86 and x64? - yes, but only within the same page. This is something you can work around with aligned loads for strlen and strchr, but a bigger obstacle for vectorizing strcmp with differently-aligned pointers.)

Instead of comparing two unknown pointers from function args, I changed your test_data caller to pass pointers to two global arrays, char foo[4], bar[4];, which the compiler knows for certain are both readable. (Godbolt). But that didn't help, so still no.

So ironically the code might turn faster if we keep looping even when the loop exit condition has been found? `int ret = 0; for (i = 0; i < n; i++) { ret |= p1[i] - p2[i]; } return ret;` results in entirely different code - _maybe_ more efficient? Both compilers unrolled the loop and clang even offered up a `pcmpeqb`. https://godbolt.org/z/adcG8nPhf — Lundin, Sep 01 '23 at 07:00
@Lundin: Yes, although it's arguably not ironic when you consider that the asm we really want also compares all 4 bytes. With SSE2, clang's strategy sucks a lot; it doesn't realize it's just a compare for equality so it actually adds negative numbers after unpacking bytes to 32-bit elements like the source did. This sucks less with a SSE4.1 `pmovzxbd` load: https://godbolt.org/z/xEc1zx7vK does `pmovzxbd mem, %xmm0` / `padd constant, %xmm0` / `ptest %xmm0,%xmm0` / `setz %al`. Branchlessness could be a big advantage if mismatches at different positions caused branch misses in the early-out. — Peter Cordes, Sep 01 '23 at 07:37
@Lundin: With `short ret`, GCC also takes a wild swing at auto-vectorizing: https://godbolt.org/z/TsqjMP5oP (and clang still widens to 32-bit elements, even if that means it needs 2 vectors for 8 bytes). Unless we also change the return type to `short`, then clang uses `pmovzxbw`, and GCC does something with 4x `pshufb` as part of the OR reduction. Yikes. https://godbolt.org/z/f5c4dq3oY — Peter Cordes, Sep 01 '23 at 07:41

chqrlie · Answer 2 · 2023-09-01T12:16:31.660

7

As coded, the compiler cannot optimize the implementation for your special case because it cannot determine that reading the second byte pointed by data is allowed if the first byte is different from 'a' and so on for the following bytes. memcmp is specified in the C Standard as having undefined behavior if any of the n bytes pointed to by both pointers are inaccessible. The compiler and/or standard library assume that 4 bytes are accessible and unaligned access is supported on the target architecture.

If you modify your implementation to read all 4 bytes, assuming unaligned access is OK for the target architecture, both gcc and clang will optimize test_data, generating 3 instructions as expected:

#include <string.h>
#include <arpa/inet.h>

static int my_memcmp(const void *s1, const void *s2, size_t n)
{
    const unsigned char *p1 = s1, *p2 = s2;
    while (n >= sizeof(unsigned)) {
        unsigned u1, u2;
        memcpy(&u1, p1, sizeof u1);
        memcpy(&u2, p2, sizeof u2);
        if (u1 != u2)
            return htonl(u1) < htonl(u2) ? -1 : 1;
        n -= sizeof(unsigned);
        p1 += sizeof(unsigned);
        p1 += sizeof(unsigned);
    }
    for (size_t i = 0; i < n; i++) {
        int ret = p1[i] - p2[i];
        if (ret)
           return ret;
    }
    return 0;
}

edited Sep 01 '23 at 12:16

answered Sep 01 '23 at 06:07

chqrlie

131,814
10
121
189

2

That has strict-aliasing UB unless the source data was `unsigned` or `int`. Also, unaligned access is never ok without telling the compiler about it, regardless of the target architecture. It usually happens to work on targets where it works in asm, but it's not safe. Use `memcpy` (aka `__builtin_memcpy`) to do 4-byte unaligned strict-aliasing-safe loads, or use GNU C `typedef uint32_t aliasing_u32 __attribute__((aligned(1), may_alias))`, as shown in [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783) – Peter Cordes Sep 01 '23 at 06:25
See also [Why does glibc's strlen need to be so complicated to run quickly?](https://stackoverflow.com/a/57676035) for a use-case like this one (word-at-a-time string functions). – Peter Cordes Sep 01 '23 at 06:26
Strict aliasing hiccups might be avoided by doing a `memcpy(&u1, s1, 4)` though - and that won't affect performance much. – Lundin Sep 01 '23 at 07:03
1

@Lundin: Indeed, that's what I meant in my first comment. In my experience, GCC and Clang fully see through such a `memcpy`, at least on targets with unaligned loads. On targets like MIPS where unaligned loads are special, they might actually call the library function to copy to a local on the stack, but might do something else for a pointer to an `__attribute__((aligned(1)))` typedef. Oh, maybe that's changed recently; MIPS64el GCC inlines those `memcpy`s as `lwl`/`lwr`: https://godbolt.org/z/bjPzGjGP4 (unless you compile for `-march=mips64r6` then it's just `lw`) – Peter Cordes Sep 01 '23 at 07:44
1

@PeterCordes: I updated the code to remove the strict aliasing violations. The new code still compiles to the trivial 3 instructions on both compilers targeting x86_64, as long as optimizations are enabled `-O2`. the code generated for `-O0` is so inefficient :) – chqrlie Sep 01 '23 at 12:20
Yup, there you go. If you cared about `-O0` code-gen at all in GNU C, you could use a typedef for an `__attribute__((aligned(1),may_alias))` version of `unsigned` or `unsigned long`, loading from pointers of that type, but it would still of course copy to locals on the stack unless you declared them `register`. – Peter Cordes Sep 01 '23 at 18:46

score 2 · Answer 3 · edited Sep 01 '23 at 18:56

2

In addition to what's already been said, a library implementation of memcmp for a 64 bit computer may try to carry out the comparison in a single instruction as long as it can deduce that the data is small enough.

So a naive implementation of an optimized memcmp for a 64-bit CPU might look like this:

static int memcmp1 (const void* s1, const void* s2, size_t n)
{
  if(n>8)
  {
    // call some other function suitable for larger chunks
    // e.g. the loop in chqrlie's answer, preferably using unsigned long
    // I used the real memcmp for illustration purposes here    
    return memcmp(s1,s2,n); 
  }

  uint64_t i1=0;
  memcpy(&i1, s1, n);  // clang manages to optimize a 4-byte write to an 8-byte
  uint64_t i2=0;
  memcpy(&i2, s2, n);
  
  if(i1 < i2)   // assume i1 and i2 are big-endian, unlike on x86-64.
    return -1;  // Use be64toh if you care about < vs >, not just != vs. ==
  if(i2 > i1)
    return 1;
  return 0;
}

https://godbolt.org/z/edjn4hx8h

Despite being rather naive code, this generates quite efficient assembler still. The compiler ought to discard the choice to call the more intricate function during inlining of memcmp1, because it knows the size of the data at compile-time. gcc failed to do so in this case, clang didn't, but either result looks pretty ok regardless.

As for the "function suitable for larger chunks", it will likely start by checking for alignment, then treat misaligned bytes in the beginning and the end of the data as special cases. And then perform the rest of the checks word by word, 8 byte at a time on a x86_64.

And as we noted, implementing such library functions without abusing strict aliasing might be tricky - there's been many functions in glibc and similar that contain blatant strict aliasing violations, which is fine as long as the lib isn't compiled as strictly conforming standard C. In which case those libs will often break spectacularly, so follow the build instructions for the lib in case you build it yourself. See a section of this answer about glibc's strlen: "Why this is safe as part of glibc but not otherwise."

But the code in this answer is actually safe; memcpy is safe for aliasing and alignment.

edited Sep 01 '23 at 18:56

Peter Cordes

328,167
45
605
847

answered Sep 01 '23 at 08:09

Lundin

195,001
40
254
396

Lexical comparison needs to treat chunks as big-endian integers if you care about greater / less, not just equal/non-equal. Portable code should use GNU `endian.h` functions (https://man7.org/linux/man-pages/man3/endian.3.html), specifically `be64toh` – Peter Cordes Sep 01 '23 at 08:35
Also, it's not that great to `memcpy` 4 bytes into an 8-byte integer; your Godbolt link shows GCC actually storing `0x64636261` with a dword store to the stack, then reloading with a qword load that gets it and 4 bytes of uninitialized garbage, and causes a store-forwarding stall. GCC chooses to do a zero-extending 4-byte load from `data` (`s1`) for the 4-byte memcpy into 8-byte `uint64_t i1`, which is good. We still get the same good code-gen if we do `uint64_t i1=0;`. (Init for the other temporary results in an extra store to the stack when inlining into that caller.) Clang makes good asm – Peter Cordes Sep 01 '23 at 08:41
I'm afraid `i1` and `i2` must be initialized to `0` if `n < 8`. – chqrlie Sep 01 '23 at 12:03
Yeah my bad they need to be zeroed, fixed. It didn't matter much for the outcome however, clang gives the same output regardless. – Lundin Sep 01 '23 at 12:41
"didn't matter much for the outcome" is a weird way to put it. I think we both agree that there's a big difference between correct code and UB or implementation-defined behaviour that *happens* to compile to the asm we want in one case with one compiler, when not inlined into a larger function. Hopefully you just meant that you didn't have to rewrite the parts of your answer that discuss this compiling to efficient asm, if that's what you mean by "outcome". (Future readers: remember that in C, testing can't prove correctness. Code with UB can easily happen to work in one context.) – Peter Cordes Sep 02 '23 at 06:46
@PeterCordes clang produced identical machine code after the fix, only gcc changed its behavior. – Lundin Sep 02 '23 at 08:53
Yes, but my point is that the code might have been broken in exciting ways if clang inlined it into a different caller (e.g. with a non-power-of-2 `n`, or in a loop), so we don't actually know how it would behave in all cases with clang or gcc. (I forget if using the value of an uninitialized local is fully UB; I seem to recall no, at least under some conditions.) I mostly commented because that kind of phrasing could give beginners the wrong idea about it being ok to use code as long as it happens to work. – Peter Cordes Sep 02 '23 at 09:01

What prevents the compiler from optimizing a hand written memcmp()?

3 Answers3