It is known that GCC/CLang auto-vectorize loops well using SIMD instructions.
Also it is known that there exist alignas() standard C++ attribute, which among other uses also allows to align stack variable, for example following code:
#include <cstdint>
#include <iostream>
int main() {
alignas(1024) int x[3] = {1, 2, 3};
alignas(1024) int (&y)[3] = *(&x);
std::cout << uint64_t(&x) % 1024 << " "
<< uint64_t(&x) % 16384 << std::endl;
std::cout << uint64_t(&y) % 1024 << " "
<< uint64_t(&y) % 16384 << std::endl;
}
Outputs:
0 9216
0 9216
which means that both x
and y
are aligned on stack on 1024 bytes but not 16384 bytes.
Lets now see another code:
#include <cstdint>
void f(uint64_t * x, uint64_t * y) {
for (int i = 0; i < 16; ++i)
x[i] ^= y[i];
}
if compiled with -std=c++20 -O3 -mavx512f
attributes on GCC it produces following asm code (provided part of code):
vmovdqu64 zmm1, ZMMWORD PTR [rdi]
vpxorq zmm0, zmm1, ZMMWORD PTR [rsi]
vmovdqu64 ZMMWORD PTR [rdi], zmm0
vmovdqu64 zmm0, ZMMWORD PTR [rsi+64]
vpxorq zmm0, zmm0, ZMMWORD PTR [rdi+64]
vmovdqu64 ZMMWORD PTR [rdi+64], zmm0
which two times does AVX-512 unaligned load + xor + unaligned store. So we can understand that our 64-bit array-xor operation was auto-vectorized by GCC to use AVX-512 registers, and loop was unrolled too.
My question is how to tell GCC that provided to function pointers x
and y
are both aligned to 64 bytes, so that instead of unaligned load (vmovdqu64
) like in code above, I can force GCC to use aligned load (vmovdqa64
). It is known that aligned load/store can be considerably faster.
My first try to force GCC to do aligned load/store was through following code:
#include <cstdint>
void g(uint64_t (&x_)[16],
uint64_t const (&y_)[16]) {
alignas(64) uint64_t (&x)[16] = x_;
alignas(64) uint64_t const (&y)[16] = y_;
for (int i = 0; i < 16; ++i)
x[i] ^= y[i];
}
but this code still produces unaligned load (vmovdqu64
) same as in asm code above (of previous code snippet). Hence this alignas(64)
hint doesn't give anything useful to improve GCC assembly code.
My Question is how do I force GCC to make aligned auto-vectorization, except for manually writing SIMD intrinsics for all operations like _mm512_load_epi64()
?
If possible I need solutions for all of GCC/CLang/MSVC.