C++17 std::byte produces less optimized code with the standard algorithms in GCC

Question

I really like std::byte as a distinct type that implements the concept of byte as specified in the C++ language definition. What I don't like is the fact that modern C++ compilers will produce less optimized code using the standard algorithms.

Here I'm playing with a function that checks the first 4 bytes in the header, you can follow my snippets on Godbolt

bool func_bytes(const std::array<std::byte, 1024>& buf) {
    constexpr std::array<std::byte, 4> header {
        std::byte{0xDE}, std::byte{0xAD}, std::byte{0xBE}, std::byte{0xAF}
    };
    return std::equal(header.begin(), header.end(), buf.begin());
}

This will produce the following assembly on x86-64 gcc trunk

func_bytes(std::array<std::byte, 1024ul> const&):
        cmp     BYTE PTR [rdi], -34
        jne     .L5
        cmp     BYTE PTR [rdi+1], -83
        jne     .L5
        cmp     BYTE PTR [rdi+2], -66
        jne     .L5
        cmp     BYTE PTR [rdi+3], -81
        sete    al
        ret
.L5:
        xor     eax, eax
        ret

If I replace the std::byte with unsigned char, then compiler will optimize to just an dword comparison.

bool func_chars(const std::array<unsigned char, 1024>& buf) {
    constexpr std::array<unsigned char, 4> header {0xDE, 0xAD, 0xBE, 0xAF};
    return std::equal(header.begin(), header.end(), buf.begin());
}

Here is the assembly produced

.LC0:
        .string "\336\255\276\257"
func_chars(std::array<unsigned char, 1024ul> const&):
        mov     eax, DWORD PTR [rdi]
        cmp     DWORD PTR .LC0[rip], eax
        sete    al
        ret

My solution to optimized std::byte version is to use old friends memcmp() and memcpy(), which is translated into the compiler's builtin.

bool func_bytes_memcmp(const std::array<std::byte, 1024>& buf) {
    constexpr std::array<std::byte, 4> header {
        std::byte{0xDE}, std::byte{0xAD}, std::byte{0xBE}, std::byte{0xAF}
    };
    return 0==std::memcmp(header.data(), buf.data(), header.size());
}

which produces the smallest code!

func_bytes_memcmp(std::array<std::byte, 1024ul> const&):
        cmp     DWORD PTR [rdi], -1346458146
        sete    al
        ret

Is this the modern C++ approach?

MSVC++ has no trouble optimizing this code, one way to poke them in the eye about it. Do actually profile it first, a modern x86_64 core has 4 integer execution units so these compare+jumps can execute concurrently. — Hans Passant, Jan 14 '23 at 22:14
@HansPassant: On Intel at least, only 2 execution units can handle branches, and only one can handle predicted-taken branches. A single dword compare is obviously much better. (Although it sucks that it didn't use an immediate constant; loading 4 bytes from a 5-byte `.string` has no advantages. And it's weird that GCC invented a terminating `0` for `std::array header {0xDE, 0xAD, 0xBE, 0xAF};`) — Peter Cordes, Jan 16 '23 at 15:04

user17732522 · Answer 1 · 2023-01-14T22:17:21.777

That's a quality issue with the implementation of std::equal in the standard library, not with the compiler itself (I think).

Note that on compiler explorer by default both GCC and Clang use libstdc++ as the standard library implementation.

It has a generic implementation of std::equal which uses a loop iterating linearly through the range and short-circuits if comparison of an element fails. When the compiler compiles this loop, it can't replace the individual byte access with single comparison because it would break the short-circuit, potentially performing a load on a byte it wasn't supposed to.

I am not entirely sure whether there is actually an architectural justification to not invent loads though or whether this is simply not done because the performance benefit is ambiguous. As far as I can see there isn't any issue with inventing a load given that we know the address to be accessible (since it is part of a C++ object in its lifetime). But that high-level information might not make it until the point that these optimizations are made.

There is also a specialization of std::equal which forwards to std::memcmp in case the value type is "simple", but that currently considers only integers and pointers. It should include std::byte.

This has already been reported on GCC's bug tracker here but doesn't seem to have received any response. Nonetheless I think this should be easy to fix and I don't think there is any problem with also considering std::byte simple for that purpose.

Note that other standard library implementations may not have the memcmp optimization for any types. E.g. with libc++ Clang produces the non-optimal code with both std::byte and unsigned char.

Btw. you can further improve the compiler output by making the header array static. It will then produce identical output to the memcmp version. I don't actually see any particular reason that it loads the string instead of using an immediate without static, but without static each call to the function ought to have a different header instance with a different address. At least before inlining that requires making the copy, since a pointer into the array is formed and the compiler wouldn't be sure whether std::equal wouldn't e.g. compare pointers to different instances of header.

As a workaround I would write my own implementation of std::equal that forwards to memcmp in more situations than the actual standard library implementation does. And then I would also make my constants static constexpr to help out the compiler a bit. But if this is a one-off situation I don't see anything wrong with your approach of using memcmp directly.

note that for std::equal you may want to call __memcmpeq if you have it: https://codebrowser.dev/glibc/glibc/string/string.h.html#79 (despite what the note says I couldn't find a compiler that is able to perform the transformation automatically yet) — Jean-Michaël Celerier, Jul 12 '23 at 20:24

C++17 std::byte produces less optimized code with the standard algorithms in GCC

1 Answers1