Clang++: Why is this memcpy loop-idiom not optimized when adding another struct member?

Question

Given this code snippet

#include <cstdint>
#include <cstddef>

struct Data {
  uint64_t a;
  //uint64_t b;
};

void foo(
    void* __restrict data_out,
    uint64_t* __restrict count_out,
    std::byte* __restrict data_in,
    uint64_t count_in)
{
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
  }
}

clang replaces the loop in foo with a memcpy call, just as expected (godbolt), giving the Rpass output:

example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
    static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;

However, when I uncomment the second member uint64_t b; in Data, it doesn't do that anymore (godbolt). Is there a reason for this, or is this just a missed optimization? In the latter case, is there any trick to still make clang apply this optimization?

I noticed that if I change value to be of type Data& instead (i.e.: Remove the temporary, local copy), the memcpy optimization is still applied (godbolt).

Edit: Peter pointed out in the comments that the same thing happens with this simpler / less noisy method:

void bar(Data* __restrict data_out, Data* __restrict data_in, uint64_t count_in) {
  for(uint64_t i = 0; i < count_in; ++i) {
    Data value = data_in[i];
    *data_out++ = value;
  }
}

The question remains: Why is it not optimized?

Hmm. Can't cope with things bigger than `sizeof(uint64_t)`? Making both members `uint32_t` restores the use of `memcpy`. — Adrian Mole, Jun 22 '22 at 14:04
You can demo the missed-optimization without any of that casting through `std::byte*`, and without updating `*count_out` inside the loop (or at all). https://godbolt.org/z/jr69e9ao4 You're ultimately dereferencing a `Data*` so it would be equivalent to cast a pointer outside the loop, and it's still UB if it's not aligned to `alignof(Data)`, so this is just over-complicating things. If you want an alignment and aliasing-safe copy, use `std::memcpy` or `__attribute__((aligned(1), may_alias))` on a typedef or in the definition of `Data`. GCC has the same missed opt. — Peter Cordes, Jun 22 '22 at 14:09
Stop lying to your compiler and maybe it will generate better code. Casting to `Data*` is only allowed if the `data_in` points at `Data` objects. So don't declare it as `std::byte*`. Same for `data_out`. The type `Data` has an alignment requirement that won't be met if you call foo with just any `void*` and `std::byte*` and will explode depending on the architecture. If you need to get an array of `Data` into or out of a (possibly) unaligned buffer then you have to explicitly memcpy(), you can't use assignment and hope the compiler makes a memcpy out of it. — Goswin von Brederlow, Jun 22 '22 at 14:10
I'd suggest checking for duplicates on https://github.com/llvm/llvm-project/issues (and GCC's bugzilla), and if you don't find any, report it there. If `memcpy` is worth using for a loop that copies `n` 8-byte elements, it's at least as likely to be worth using for a loop copying `n` 16-byte elements. — Peter Cordes, Jun 22 '22 at 14:11
Re: @GoswinvonBrederlow's point about alignment: even on architectures where unaligned scalar load/store works, it's still C UB and it *can* still be a problem, for example [Why does unaligned access to mmap'ed memory sometimes segfault on AMD64?](https://stackoverflow.com/q/47510783) is one example, and links to blogs about 2 more. But unfortunately, turns out that wasn't what was defeating the memcpy optimization. And the struct has no padding, so it's not GCC being unwilling to write padding. But yeah I wondered the same thing. — Peter Cordes, Jun 22 '22 at 14:12
`Data &temp = ...` still fixes it with the simplified version, so that's pretty clearly a compiler bug. A value vs. reference temporary shouldn't make a difference; that's not a good signal to the compiler about whether the loop is likely to be long-running or not, so it's probably not intentionally applying heuristics to decided whether calling memcpy is a good idea. — Peter Cordes, Jun 22 '22 at 14:18
Indeed, my comment has nothing to do with the failed optimization. I just feel it is pointless to look into possible optimizations for code that is going to be UB. Do it right first and then look into optimization. Possibly if you write correct and safe code the optimization might already work correctly. — Goswin von Brederlow, Jun 22 '22 at 14:20
@GoswinvonBrederlow: There's no UB if `std::byte* __restrict data_in` is aligned by 8 and pointing to an actual `Data` object (and `data_out` is also pointing to a `Data` object), so the compiler still has to produce asm that works for that case. But yes, the only reason you'd write it that way is to make UB possible, and it is pointlessly over-complicated for the non-UB case. So yes, I agree that noise should just be removed from the question, and shouldn't have been there in the first place, because it's not a [mcve], not because there's any unavoidable compile-time-visible UB. — Peter Cordes, Jun 23 '22 at 00:59

Clang++: Why is this memcpy loop-idiom not optimized when adding another struct member?

0 Answers0