my issue concerns deriving an unaligned __m512 pointer to a memory space containing floats. I find that GCC and Clang are somewhat unstable in generating the correct uop (unaligned vs aligned) when accessing memory through such a contraption.
First, the working case:
typedef float MyFloatVector __attribute__((vector_size(64), aligned(4)));
MyFloatVector* vec_ptr = reinterpret_cast<MyFloatVector*>(float_ptr);
Something(*vec_ptr);
Both Clang and GCC generate MOVUPS for the above. However, if the type for vec_ptr is left for the compiler:
typedef float MyFloatVector __attribute__((vector_size(64), aligned(4)));
auto vec_ptr = reinterpret_cast<MyFloatVector *>(float_ptr);
Something(*vec_ptr);
Now, Clang will generate MOVAPS and a segfault down the line. GCC will still generate MOVUPS, but also three do-nothing instructions (push rbp, load rsp to rbp, pop rbp).
Also, if I change from typedef to using:
using MyFloatVector = float __attribute__((vector_size(64), aligned(4)));
MyFloatVector*vec_ptr = reinterpret_cast<MyFloatVector*>(float_ptr);
Something(*vec_ptr);
Again GCC generates the fluff instructions and Clang generates MOVAPS. Using auto here gives the same result.
So, does anyone have any idea what's happening under the hood, and is there a safe way to do the conversion. While there exists a working solution, IMO the discrepancies generated by typedef/using and explicit/auto make it far too unreliable to use with confidence--at the minimum I'd need a static assert to check that the generated uop when dereferencing the pointer is unaligned, which doesn't exist AFAIK.
In some cases I might want to have a MyFloatVector-reference to the memory area, which rules out using intrinsics.
Sample code: https://godbolt.org/z/caxScz. Includes ICC for "fun", which generates MOVUPS throughout.