Why does memcpy cause compilers to seemingly forget about strict aliasing?

Question

Consider the following C++ code:

std::uint32_t func(std::uint32_t* p1, std::uint64_t* p2) {
    *p2 = *p1;
    return *p1;
}

Compiling with -O3 yields the following disassembly on Clang (GCC is similar):

func(unsigned int*, unsigned long*):
        mov     eax, dword ptr [rdi]
        mov     qword ptr [rsi], rax
        ret

(Live demo: https://godbolt.org/z/vKPvT7o51)

Due to the strict aliasing rule, the compiler assumes p1 and p2 don't point to the same memory location, and thus there is no need to reload *p1 from memory for the return statement. So far so good, optimisations are working as I expect.

Now consider a similar piece of code, where the copy is done via memcpy():

std::uint32_t func(std::uint32_t* p1, std::uint64_t* p2) {
    std::memcpy(p2, p1, sizeof(std::uint32_t));
    return *p1;
}

Disassembly (identical on Clang and GCC):

func(unsigned int*, unsigned long*):
        mov     eax, dword ptr [rdi]
        mov     dword ptr [rsi], eax
        mov     eax, dword ptr [rdi]
        ret

(Live demo: https://godbolt.org/z/rszEEc1Gr)

This time, *p1 is reloaded from memory, as if writing via p2 could have modified the memory at p1 - i.e. p1 and p2 could be aliased.
I find this to be strange, because:

The compiler already knows that p1 and p2 do not alias due to the strict aliasing rule, as demonstrated in the first example.
(Even if the above point is untrue) memcpy() has the requirement that its arguments cannot alias, otherwise it's UB. I presume that compilers are aware of this restriction and take advantage of it for optimisation purposes (otherwise you'd end up with a slower memcpy()).

Therefore it seems as though the use of memcpy() is causing the compiler to forget about these aliasing restrictions.

What is going on here?

Is it simply a missed optimisation (or bug?) by the compiler?
Is it behaviour somehow required by the Standard?
Is it UB and thus we can't reason about the generated code?

On the point of UB, one line of possible reasoning is that the code is UB because the types of the source and destination are mismatched, which is likely not allowed (possibly violates the strict aliasing rule?). However, I do not believe this is true, because memcpy() copies at the byte level and it doesn't care what the types of its argument pointers are. p1 and p2 could just as well point to objects of the same type, but have been reinterpret_cast'd into different pointer types (which I believe is OK, as long as those pointers are not dereferenced, which they are not in this example).

One could then argue that if p1 and p2 could have been reinterpret_cast'd from any memory location, then they could alias, and thus the compiler must reload *p1 from memory. However, the restriction that memcpy()'s source and destination cannot overlap is still present, so the compiler should still know that p1 and p2 cannot alias.

`unsigned char*` can alias anything in ISO C++. `memcpy` copies memory as-if by accessing it as a `unsigned char*` (https://en.cppreference.com/w/cpp/string/byte/memcpy). Bypassing strict-aliasing restrictions is one of the major use-cases for `memcpy` in modern C++, especially before C++20 `std::bit_cast`. (In modern C++ you'd typically use `std::copy` for actually copying stuff.) — Peter Cordes, May 22 '22 at 05:15
@PeterCordes yes, I did note that, but I'm not sure it fully explains the results I'm seeing. Due to the aliasing restriction on the source and destination arguments, I assume no `memcpy()` implementation actually considers that they could alias, since that would be a pessimisation. The compiler also evidently has special understanding of how `memcpy()` works, since it often elides (or introduces) the call and does various other optimisation magic with it. So I'd expect that the compiler is also aware of the aliasing restrictions, hence at best I'd say my example is a missed optimisation. — MC ΔT, May 22 '22 at 05:28
Oh, interesting point. Yeah, I guess GCC/clang just treat it like `memmove` as far as optimizing. It's how you express an aliasing-safe unaligned load/store in ISO C++. BTW, you know your functions are not equivalent, right? Your second one doesn't zero-extend into the high half of the destination pointer, so it *only* writes the 32-bit low half. (Which is necessary for your argument about overlapping memcpy already being UB.) — Peter Cordes, May 22 '22 at 05:35
So yes, I think GCC/clang don't intentionally break code that assumes memcpy = memmove, because that's been a problem in enough real codebases... If GCC doesn't inline memcpy, glibc memcpy's implementation strategy for small copies (up to 63 bytes on machines with AVX) is to do 2 unaligned loads that reach the start/end of the region, then two stores. So that also happens to work as a memmove. I forget if current glibc memcpy actually checks for overlap on larger copies like memmove has to; it might just to combine those functions into one thing to stay hot in L1i. — Peter Cordes, May 22 '22 at 05:37
@PeterCordes If glibc's `memcpy()` and `memmove()` are equivalent, that's really interesting, I didn't know about that. W.r.t. my second example, yes I am aware that it's not equivalent, but I wasn't sure how to get equivalent examples without making the pointer types the same (which would sidestep any strict aliasing concerns). I'm not sure how it's relevant to my argument about UB though, could you elaborate? Any way I look at it, it's still copying between memory locations which can't alias or else it's technically UB. — MC ΔT, May 22 '22 at 05:54
@PeterCordes Note, if we change the `memcpy()` to copy all 64 bits - and perhaps imagine that `p1` points to an array of 2 `std::uint32_t`, the resulting disassembly is still similar. https://godbolt.org/z/f6KqcsPEW — MC ΔT, May 22 '22 at 05:56
Oh yes, that would be another case where the non-overlapping clause would kick in since read and write widths are the same. But it looks less weird. Also least less inefficient (no store-forwarding stall if the caller reloads the uint64_t right away.) — Peter Cordes, May 22 '22 at 06:09
@MCΔT Here's something interesting: gcc does not seem to propagate the information that the args of memcpy don't alias, and you have to write it out explicitly https://godbolt.org/z/rvxEnYvY5 — Artyer, May 22 '22 at 06:16
Gcc requires memcpy(p,p,n) (exact overlap) to be valid (it sometimes generates it from plain assignments), although partial overlap is allowed to break. — Marc Glisse, May 22 '22 at 06:17
The compilers I know consider that the type of a pointer is irrelevant as long as you do not dereference it, they only start looking at the type for aliasing purposes when they see a read or write through that pointer. In your second exemple, p2 is never dereferenced (a `char*` version of it may be dereferenced inside memcpy, but nothing that knows about uint64_t). — Marc Glisse, May 22 '22 at 06:22
@MarcGlisse adding dereferencing for `p2` doesn't change the assembler output in this example. — ixSci, May 22 '22 at 07:10
If you mean you added `*p2;` then that's just removed early. If you really use it (say `volatile auto x=*p2;`), then a large part of gcc's alias optimizations works at the level of statements (can this write alias this read), not so much at the level of variables, so adding an unrelated statement seldom helps. The easiest way to get the optimization here while keeping memcpy would be to add `__restrict` to p1 or p2. — Marc Glisse, May 22 '22 at 07:42
@MarcGlisse shouldn't aliasing be determined before the optimization block? I have no clue how exactly the modern compilers handle this but since it is a C++ term it would be fitting resolving all the C++ parts before going into the optimization part. — ixSci, May 22 '22 at 07:51
@ixSci: You need optimization to be able to prove cases when two pointers of the same type can't alias each other, including constant-propagation and tracking pointer values through things. e.g. if two pointers were derives from separate `new` statements, they can't point to the same object. Type-based aliasing is one useful way to rule out aliasing, but it's often possible (and helpful) to do better. Also, GCC doesn't really start thinking about what code means until it has an internal representation of it (GIMPLE) so the same optimizing middle-end can work for multiple language front-ends — Peter Cordes, May 22 '22 at 16:19
@PeterCordes but you don't need optimization to prove that pointers aren't aliasing in case of different types. Compiler might as well mark them `__restrict` right away and treat them that way. If it is implemented differently, fine. But I don't see why it can't be made that way because in that simple example of OP`s one would expect an optimization. — ixSci, May 22 '22 at 16:28
@ixSci: You need optimization enabled for that information to be of any use, though. But anyway, Marc Glisse already explained that in-place memcpy of something over itself is treated as valid, unlike partial overlap, because GCC will compile `structA = structB` into a memcpy if they're large enough. But assigning something to itself is always valid. So it would probably be more trouble than it's worth for GCC to keep track of whether it was a real memcpy in the source or one it invented. I expect the optimization proposed in this question would very rarely help real code. — Peter Cordes, May 22 '22 at 16:50
@ixSci: And no, `__restrict` wouldn't be accurate; that would mean that they couldn't alias with any other access from *any* pointer. But actually it's just this one pair of locations that are known to be disjoint, if you did want to treat `memcpy(same,same, size)` as UB. Maybe there's some way g++ could annotate that in the GIMPLE it's generating in the first place, IDK. If so, then it might not be too hard to implement this optimization if anyone wanted to do it. — Peter Cordes, May 22 '22 at 16:54
"The compiler already knows that p1 and p2 do not alias" is an oversimplification of the strict aliasing rule. A slightly more precise statement would be that you may not access the same memory through pointers of types `uint32_t *` and `uint64_t *`. But casting and conversion can create a pointer of another type that points to the same memory. In the case of `memcpy` you end up with an `unsigned char *` and you *are* allowed to access the same memory through such a pointer, as a special exception to the strict aliasing rule. — Nate Eldredge, May 22 '22 at 18:40
So strict aliasing is a red herring here. If you used `memmove` instead, then nothing would have changed from a strict aliasing perspective, but the code would be perfectly legal (albeit with an implementation-defined result) even if `p1` and `p2` do alias, and in particular, the compiler would be obligated to reload `*p1` or behave as if it did so. — Nate Eldredge, May 22 '22 at 18:41
@NateEldredge: The question buries the key idea, but it's that using `memcpy` instead of `memmove` intentionally makes it UB if `*p1` and `*p2` overlap. (Not via type-based aliasing but via memcpy's special rule that the input can't alias the output, so you're right that *strict* aliasing is a red herring). That in theory lets the compiler optimize as if they don't. The question is why it chooses not to do that, instead treating it like you'd used `memmove`. — Peter Cordes, May 22 '22 at 19:36
@PeterCordes: Yes, agreed. It just seems to me that "assume p1, p2 don't alias because they are of different types" is a rather different deduction than "assume p1, p2 don't alias because they were both passed to `memcpy`". I wouldn't particularly expect that a compiler capable of doing #1 would therefore also do #2, and so I don't find it surprising that the optimization occurs in #1 but not #2. They would have to implement #2 separately, and like you said, it's probably rare enough that it wouldn't have much benefit. — Nate Eldredge, May 22 '22 at 19:50
Also to nitpick: given `memcpy(p1, p2, size)`, in order to validly conclude `p1, p2` don't alias, the compiler must also prove that `size` cannot be zero. I think that `memcpy(ptr, ptr, 0)` is well defined (and does nothing). In this case `size == sizeof(uint32_t)` which is indeed nonzero, but it is just one more hoop that the compiler would have to jump through in order to do this optimization, and further reduces the number of cases where it would be applicable. — Nate Eldredge, May 22 '22 at 19:53
Come to think of it, consider `uint64_t *p1; uint32_t *p2 = (uint32_t *)p1 + 1 ; memcpy(p2, p1, 4)`. Then the `memcpy` is perfectly legal as the blocks do not overlap, but it does result in `*p1` being modified, and so the compiler must not optimize out a reload of `*p1`. — Nate Eldredge, May 22 '22 at 19:58
@NateEldredge: Right, yes, 100% agreed, good way of putting it in terms of being a separate optimization that compilers would have to look for separately. Not in terms of "breaking" strict-alias optimizations; working around strict-aliasing is one of the most important features of `memcpy` when trying to write C++ that compiles to "interesting" asm. — Peter Cordes, May 22 '22 at 20:02
@NateEldredge I'm not sure your last example is valid. `memcpy` talks about objects not memory blocks and your _objects_ overlap. Also here is the declaration of `memcpy` from C18: `void *memcpy(void * restrict s1, const void * restrict s2, size_t n);` which your example also violates. Yes we have C++ here but the intent is clear for both languages. — ixSci, May 23 '22 at 05:45
@NateEldredge Interesting points that you have raised. On your point about this being a rare scenario, I feel like maybe it's not that rare. It seems to me that it's generalisable to: if an object is used as the source argument to `memcpy()`, then subsequent usages of that object must be reloaded from memory. That seems like a common-enough and not-insignificant pessimisation, if true. — MC ΔT, May 23 '22 at 07:39
@MCΔT: I think the way I would put it is this. By default, *any* call to `memcpy` forces *every* object to be reloaded from memory, unless proved otherwise. This is because it effectively dereferences an `unsigned char *`, which under the strict aliasing rule is allowed to alias anything. The point you raise is that there is one exception: namely, the object used as the source does *not* actually have to be reloaded, because if it did overlap the destination it would be UB. And so the compiler could optimize out the reload in that one special case. — Nate Eldredge, May 23 '22 at 17:55
But that case is indeed *special* and quite narrow, and that's why I'm not surprised that the optimization isn't done. — Nate Eldredge, May 23 '22 at 17:55
@ixSci: I asked that as a [new question](https://stackoverflow.com/questions/72356568/can-you-memcpy-between-non-overlapping-regions-of-the-same-object). — Nate Eldredge, May 24 '22 at 02:22
By "overlap the destination" you mean if `p2` holds an address that is exactly one higher than `p1` and you're copying more than one byte, thereby effectively performing a `memset` with the value being the byte stored at address `p1`? — puppydrum64, Dec 15 '22 at 13:53

Why does memcpy cause compilers to seemingly forget about strict aliasing?

0 Answers0

Linked