Given this code snippet
#include <cstdint>
#include <cstddef>
struct Data {
uint64_t a;
//uint64_t b;
};
void foo(
void* __restrict data_out,
uint64_t* __restrict count_out,
std::byte* __restrict data_in,
uint64_t count_in)
{
for(uint64_t i = 0; i < count_in; ++i) {
Data value = *reinterpret_cast<Data* __restrict>(data_in + sizeof(Data) * i);
static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
}
}
clang replaces the loop in foo
with a memcpy call, just as expected (godbolt), giving the Rpass output:
example.cpp:16:59: remark: Formed a call to llvm.memcpy.p0.p0.i64() intrinsic from load and store instruction in _Z3fooPvPmPSt4bytem function [-Rpass=loop-idiom]
static_cast<Data* __restrict>(data_out)[(*count_out)++] = value;
However, when I uncomment the second member uint64_t b;
in Data
, it doesn't do that anymore (godbolt). Is there a reason for this, or is this just a missed optimization? In the latter case, is there any trick to still make clang apply this optimization?
I noticed that if I change value
to be of type Data&
instead (i.e.: Remove the temporary, local copy), the memcpy optimization is still applied (godbolt).
Edit: Peter pointed out in the comments that the same thing happens with this simpler / less noisy method:
void bar(Data* __restrict data_out, Data* __restrict data_in, uint64_t count_in) {
for(uint64_t i = 0; i < count_in; ++i) {
Data value = data_in[i];
*data_out++ = value;
}
}
The question remains: Why is it not optimized?