Why don't modern compilers coalesce neighboring memory accesses?

Question

Consider the following code:

bool AllZeroes(const char buf[4])
{
    return buf[0] == 0 &&
           buf[1] == 0 &&
           buf[2] == 0 &&
           buf[3] == 0;
}

Output assembly from Clang 13 with -O3:

AllZeroes(char const*):                        # @AllZeroes(char const*)
        cmp     byte ptr [rdi], 0
        je      .LBB0_2
        xor     eax, eax
        ret
.LBB0_2:
        cmp     byte ptr [rdi + 1], 0
        je      .LBB0_4
        xor     eax, eax
        ret
.LBB0_4:
        cmp     byte ptr [rdi + 2], 0
        je      .LBB0_6
        xor     eax, eax
        ret
.LBB0_6:
        cmp     byte ptr [rdi + 3], 0
        sete    al
        ret

Each byte is compared individually, but it could've been optimized into a single 32-bit int comparison:

bool AllZeroes(const char buf[4])
{
    return *(int*)buf == 0;
}

Resulting in:

AllZeroes2(char const*):                      # @AllZeroes2(char const*)
        cmp     dword ptr [rdi], 0
        sete    al
        ret

I've also checked GCC and MSVC, and neither of them does this optimization. Is this disallowed by the C++ specification?

Edit: Changing the short-circuited AND (&&) to bitwise AND (&) will generate the optimized code. Also, changing the order the bytes are compared doesn't affect the code gen: https://godbolt.org/z/Y7TcG93sP

That little data so close in memory is probably going to end up in the same cache-line. Memory access isn't going to be that much of an issue. — Some programmer dude, Nov 25 '21 at 11:51
Because the compiler does not know the size of the array and `&&` is short-circuit evaluation. The array indexes greater than `0` may only be valid if `buf[0] == 0` is `true`. Evaluating `buf[1] == 0 &&` may be UB if the first test is `false` — Richard Critten, Nov 25 '21 at 11:51
@Quentin Doesn't look so, forcing alignment doesn't change a thing: https://godbolt.org/z/e7fvjEoEe — , Nov 25 '21 at 11:52
@RichardCritten isn't the array size explicit in the parameter? Or is that not part of the signature? — perivesta, Nov 25 '21 at 11:54
@dave no, the array size in a function parameter is only a comment for the developer. `const char buf[4]` is exactly the same as `const char *buf` as function parameter. — mch, Nov 25 '21 at 11:55
It collapses to `const char *` see live - https://godbolt.org/z/n11aEY9f7 — Richard Critten, Nov 25 '21 at 11:56
@RichardCritten That makes sense, changing `&&` to bitwise AND will generate the optimized code. I've also tried comparing the 3rd byte first but again, no luck: https://godbolt.org/z/Y7TcG93sP — , Nov 25 '21 at 11:59
related: https://stackoverflow.com/questions/48929382/can-the-compiler-jit-optimize-away-short-circuit-evaluation-if-there-are-no-side — OrenIshShalom, Nov 25 '21 at 12:18
Passing the array by reference doesn't optimize the code even though the size is known. The short circuit has priority. https://godbolt.org/z/r73czsdMb — stefaanv, Nov 25 '21 at 12:23
`return *(int*)buf == 0;` is technically UB unless the passed in `buf` really does point at a `int`. Casting any pointer to `char *` and dereferencing is OK but casting `char *` to `int *` and dereferencing is only ok if the `char *` was originally pointing to an `int`. [note: all the `const`s removed for clarity] — Richard Critten, Nov 25 '21 at 17:26
`*(int*)buf` is strict-aliasing UB. Use `const char zero[4] = {}; memcmp(buf, zero, sizeof(buf));`, @RichardCritten. (Or if you want to make assumptions about 1-byte char and so on, use `int32_t zero = 0;`, since GCC has a missed optimization with a local array.) https://godbolt.org/z/8rPE43WaT shows this compiling to `cmp dword ptr [rsp+4], 0` after inlining `memcmp`. (Compiling for some ISAs without efficient unaligned loads won't inline memcmp if you don't have an alignment guarantee, though. Like MIPS.) — Peter Cordes, Nov 26 '21 at 02:58
And BTW, in general GCC8 and later *is* capable of some load and store coalescing, fixing the gcc2.95 -> gcc3 regression from decades before! So finally code like `a[0] | (a[1] << 8} | ...` can actually compile to a single load on a little-endian machine. — Peter Cordes, Nov 26 '21 at 02:59
Clang gets it right in C99 with `_Bool AllZeroes1(const char buf[static 4])` , while gcc doesn't: gcc https://godbolt.org/z/56TvbEEhe : clang: https://godbolt.org/z/Grqs3En3K (thanks to Caze @libera #C ) — hanshenrik, Nov 26 '21 at 19:17
If you mean a 32bit integer, use `uint32_t` (unsigned) or `int32_t` (signed). An `int` might be 16bit or 32bit or 64bit or something completely different. — ndim, Nov 27 '21 at 01:26
@RichardCritten Re: *"Because the compiler does not know the size of the array and && is short-circuit evaluation. The array indexes greater than 0 may only be valid if buf[0] == 0 is true":* On this specific target architecture accessing an uninitialized memory location is always OK, so the UB from the standard simply doesn't apply. And the size of the array appears irrelevant. (Of course, this machine-specific optimization may simply not have been implemented because that would be too much effort -- in the general case your objection is valid.) — Peter - Reinstate Monica, Nov 27 '21 at 09:33
@Peter-ReinstateMonica edge case - assume that buf[0] is the last address in a memory page. Then buf[1] and higher would be in the next page __but__ this page has not been allocated. So accessing the address as a 4 byte region would segfault. So I believe the code generator/optimiser has to play it safe. — Richard Critten, Nov 27 '21 at 09:46
@RichardCritten Shouldn't that prevent the `&` case as well? — Peter - Reinstate Monica, Nov 27 '21 at 10:10
Another question is whether the function gets optimized when it is visible (inline or file static): The compiler would then have all information to safely optimize more. — Peter - Reinstate Monica, Nov 27 '21 at 10:23
@Peter-ReinstateMonica The `&` case we are telling the compiler - go ahead evaluate everything (no short circuit) I guarantee that it's safe to do so (as good programmers we would never put UB in our code). — Richard Critten, Nov 27 '21 at 10:24
@RichardCritten How does the "I know the addresses are all valid" argument not apply to &&? page alignment doesn't depend on the operator (race conditions may though, as has been pointed out). — Peter - Reinstate Monica, Nov 27 '21 at 10:32
The discussion is somewhat academic (duh) if the function is inlined, see https://godbolt.org/z/WYv5MPc1d: cmp dword... — Peter - Reinstate Monica, Nov 27 '21 at 10:33
@Peter-ReinstateMonica Operator `&&` requires short circuit evaluation. The compiler can only apply the as-if-rule and allow the optimiser to evaluate all the arguments in the expression if it can prove it is safe to do so (as you say this could happen in the inlined case). The end-of-page/page-fault example I think disallows this application of the as-if-rule as a segfault is a noticeable difference. — Richard Critten, Nov 27 '21 at 10:41

score 73 · Answer 1 · edited Nov 25 '21 at 12:24

73

If buf[0] is nonzero, the code will not access buf[1]. So the function should return false without checking the other buf elements. If buf is close to the end of the last memory page, buf[1] may trigger an access fault. The compiler should be very careful to not read stuff which may be forbidden to read.

edited Nov 25 '21 at 12:24

bolov

72,283
15
145
224

answered Nov 25 '21 at 12:04

anatolyg

26,506
9
60
134

4

That's a good reason for the example in question. Furthermore, even if there is no access fault, there could be arbitrary memory in the accessible memory as well. However, the compiler appears to not optimise even when the size of the array is known: https://godbolt.org/z/ch41zd5Wc (edit: also added alignment to see if that helps; it does not) – eerorika Nov 25 '21 at 12:06
4

A very clever compiler might (in principle) clone the function when `buf` is word aligned, e.g. with whole program optimization – Basile Starynkevitch Nov 25 '21 at 12:06
it can be just `reinterpret_cast`'ed stuff from something that you never seen. without any mem bounds access to it could ruin say OS haha – Алексей Неудачин Nov 25 '21 at 12:36
2

@eerorika https://godbolt.org/z/dfnhhx7oG – n. m. could be an AI Nov 25 '21 at 13:50
2

@eerorika since the array in your compiler-explorer example isn't initialized and the called function may not initialize all elements, accessing may result in UB so the `&&` short circuit can't be short circuited by a single read without possible UB – doug Nov 25 '21 at 18:19
@doug Good catch. Initialisation isn't sufficient to make clang optimise though: https://godbolt.org/z/vooh99TeT – eerorika Nov 25 '21 at 19:21
wrong statement. if `arr[0]` is zero, not non-zero – Алексей Неудачин Nov 25 '21 at 20:54
@АлексейНеудачин: No, the answer as written is correct. If `buf[0]` is nonzero, then `buf[0] == 0` is false, and the `&&` chain stops there. – user2357112 Nov 26 '21 at 01:48
@eerorika: Your local-array example in https://godbolt.org/z/ch41zd5Wc is just a clang missed-optimization; it ideally would compile that to `cmp dword ptr [rsp+4], 0` / `setz al`. GCC doesn't manage that either, but does unconditionally `OR` all 4 bytes together and `sete al` based on the FLAGS result from the last `or`, i.e. whether there were any non-zero bits in any of the four bytes: https://godbolt.org/z/K8Ks4xxEM – Peter Cordes Nov 26 '21 at 02:44
8

@doug: x86 has no trap representations for integers, so no, compilers don't need to avoid doing things in asm for the specific target just because they would be UB in portable C++, as long as the as-if rule is satisfied for all cases where the C++ abstract machine doesn't encounter UB. Speculative reads are allowed if they're from locations that definitely can't fault (on targets without hardware race detection, i.e. normal CPUs). `pretend_modify` could have passed the address to another thread which is currently writing the last byte, no UB unless the first 3 are zero. – Peter Cordes Nov 26 '21 at 02:50
But again that's fine: the whole dword will compare not equal to zero if there are any set bits in the first 3 bytes, regardless of what's in the last byte. So the value of that byte is only relevant if the C++ abstract machine would have read it. – Peter Cordes Nov 26 '21 at 02:51
I wonder if the compiler _would_ decide to optimize if the highest index would come first: `return buf[3] == 0 && buf[2] == 0 && buf[1] == 0 && buf[0] == 0;` Or should the compiler also take into account the possibility of `buf` being a 'negative' address? – Ruud Helderman Nov 26 '21 at 09:41
I had hoped that passing the array by reference ([`bool AllZeroes(const char (&buf)[4])`](https://godbolt.org/z/fvErcP6GK)) might make a difference (as the compiler now knows all elements are valid), but I see no change. – Toby Speight Nov 26 '21 at 10:52
@RuudHelderman If the highest address was checked first, it would *in principle* know all addresses can be read. Thus, the only problem is mis-alignment, depending on the underlying architecture (no problem for x86). – Deduplicator Nov 26 '21 at 11:33
OP's code is specifically saying it's `a pointer to 4x chars` , if it was just `const char *buf` i would agree with you, but `const char buf[4]` sounds like a guarantee-from-the-programmer-to-the-compiler that its pointing to 4 bytes, not 1, or 2, or 3 – hanshenrik Nov 26 '21 at 12:36
@hanshenrik You are mis-reading what the code means. The prototype `bool AllZeroes(const char buf[4])` is read as `bool AllZeroes(const char* buf)`. – Deduplicator Nov 26 '21 at 13:40
@Deduplicator In all fairness that's how the language sees it, but the compiler could assume that the coder specified the length on purpose. (Is there a provision in the standard that forbids declaring a wrong -- in particular a too long -- length?) – Peter - Reinstate Monica Nov 27 '21 at 10:15
1

@Peter-ReinstateMonica I gave you exactly how the compiler reads it. C did retrofit the option `bool AllZeroes(const char buf[static 4])` for a [guaranteed minimal length in C99](//en.cppreference.com/w/c/language/array), but C++ didn't pick it up. – Deduplicator Nov 27 '21 at 10:41

score 25 · Answer 2 · edited Nov 29 '21 at 07:11

25

The first thing to understand is that f(const char buf[4]) does not guarantee that the pointer points to 4 elements, it means exactly the same as const char *buf, the 4 is completely ignored by the language. (C99 has a solution to this, but it's not supported in C++, more on that below)

Given AllZeroes(memset(malloc(1),~0,1)), the implementation

bool AllZeroes(const char buf[4])
{
    return buf[0] == 0 &&
           buf[1] == 0 &&
           buf[2] == 0 &&
           buf[3] == 0;
}

should work, because it never tries to read byte #2 (which doesn't exist) when it notices that byte #1 is non-zero, while the implementation

bool AllZeroes(const int32_t *buf)
{
    return (*buf == 0);
}

should segfault as it tries to read the first 4 bytes while only 1 byte exists (malloced 1 byte only)

FWIW Clang gets it right (and GCC doesn't) in C99 with the implementation

_Bool AllZeroes(const char buf[static 4])
{
    return buf[0] == 0 &
           buf[1] == 0 &
           buf[2] == 0 &
           buf[3] == 0;
}

which compiles to the same as

_Bool AllZeroes(const int32_t *buf)
{
    return (*buf == 0);
}

see https://godbolt.org/z/Grqs3En3K (thanks to Caze @libera #C for finding that)

unfortunately buf[static 4], which in C99 is a guarantee-from-the-programmer-to-the-compiler that the pointer points to minimum 4 elements, is not supported in C++

edited Nov 29 '21 at 07:11

Cortex0101

831
2
12
28

answered Nov 26 '21 at 19:37

hanshenrik

19,904
4
43
89

6

Generally agree with this answer, but "should segfault" really isn't a thing. Malloc doesn't make any gaurantees about the addresses outside of the allocation it doesn't gaurantee they are accessible but it doesn't gaurantee they are inaccessible either. – plugwash Nov 26 '21 at 21:44
4

`const std::array *buf` - pretty sure that accessing any member of a class object implies that an entire object of that class type is present and fully accessible, even if the only member is an array. e.g. that it would be UB to pass a pointer to the last byte of a page as such an arg. (Another thread might be writing one of the members, but that's fine in asm). [Allocating memory for a part of structure](//stackoverflow.com/q/53904602) is about C, not C++. I think there was a C++ Q&A about this at some point in the past year or two, but I didn't find it with some searching. – Peter Cordes Nov 27 '21 at 03:16
@PeterCordes yeah i'm pretty sure you're right, i removed the std::array section, thanks – hanshenrik Nov 27 '21 at 07:00
@plugwash you're right, should explain it differently, but i'm not sure how exactly, you're free to edit it ^^ – hanshenrik Nov 27 '21 at 07:16
2

I thought that section was actually pretty relevant. It's strict-aliasing UB to point a `std::array` pointer at an object that isn't a `std::array` (although it probably works in practice), but re-designing the function to take a `const std::array` by pointer or reference instead of a raw `const char*` is something one might want to do, showing one indirect advantage of using container classes instead of C-style arrays. It's C++'s answer to `[static 4]`. (In this case it's small enough that even taking it by value might be even better; having the 4 bytes in one register arg is perfect) – Peter Cordes Nov 27 '21 at 07:17
@PeterCordes neat, you're free to add std::array info, but i won't do that personally because honestly, i know uncomfortably little about the subject – hanshenrik Nov 27 '21 at 08:00
3

What if instead of the aliasing cast to `std::array*` you used an input of type `std::span`? According to cppreference.com it's UB to try to construct a span of static positive extent pointing to an invalid range; so technically, I guess a compiler could coalesce accesses to such a span. (Though as far as I know, there would have to be compiler-specific annotations of some kind in the standard library headers to be able to implement such an optimization -- or make `std::span` alias or wrap a built-in type, etc.) – Daniel Schepler Nov 27 '21 at 18:00
It seems, replacing `const char buf[static 4]` with `const char *buf` doesn't prevent the optimization. – HolyBlackCat Nov 28 '21 at 13:23

score 14 · Answer 3 · edited Nov 26 '21 at 16:02

14

There's the short-circuit evaluation thing. So it can't be optimized as you think. If buf[0] == 0 is false buf[1] == 0 must not be checked. It can be UB or something forbidden to use or whatever - this all must still work.

https://en.wikipedia.org/wiki/Short-circuit_evaluation

edited Nov 26 '21 at 16:02

Jens

69,818
15
125
179

answered Nov 25 '21 at 12:31

Алексей Неудачин

405
4
13

5

Re: "it can be ub": Undefined behavior is a property of C++ code, not a property of compiler-generated assembly code. It's 100% fine for a C++ compiler to perform an optimization that would be "undefined behavior" if you wrote it in C++, as long as the compiler knows that the generated assembly will be safe on the systems that the generated code supports. – ruakh Nov 26 '21 at 06:43
2

@ruakh: Right, it's not ISO C abstract-machine UB that's the problem, it's stuff that might fault on actual x86. In this case, it's unmapped pages. Without an alignment guarantee on the pointer, `arr[3]` could be in the next page, which might be unmapped. It's legal (no C UB) to pass this function a pointer to the last byte of a a page, if the byte is non-zero. (It's the `strlen` problem in reverse: [vectorized strlen getting away with reading unallocated memory](https://stackoverflow.com/q/25566302) and [this](//stackoverflow.com/q/37800739)) – Peter Cordes Nov 26 '21 at 11:12
2

(In another comment thread, [I pointed out](https://stackoverflow.com/questions/70110562/why-dont-modern-compilers-coalesce-neighboring-memory-accesses/70123361#comment123951270_70110802) that C data-race UB is specifically not a problem on x86, or other normal ISAs, because speculative reads are well defined as long as you don't care about the value.) – Peter Cordes Nov 26 '21 at 11:15

Why don't modern compilers coalesce neighboring memory accesses?

3 Answers3