1

my issue concerns deriving an unaligned __m512 pointer to a memory space containing floats. I find that GCC and Clang are somewhat unstable in generating the correct uop (unaligned vs aligned) when accessing memory through such a contraption.

First, the working case:

typedef float MyFloatVector __attribute__((vector_size(64), aligned(4)));
MyFloatVector* vec_ptr = reinterpret_cast<MyFloatVector*>(float_ptr);
Something(*vec_ptr);

Both Clang and GCC generate MOVUPS for the above. However, if the type for vec_ptr is left for the compiler:

typedef float MyFloatVector __attribute__((vector_size(64), aligned(4)));
auto vec_ptr = reinterpret_cast<MyFloatVector *>(float_ptr);
Something(*vec_ptr);

Now, Clang will generate MOVAPS and a segfault down the line. GCC will still generate MOVUPS, but also three do-nothing instructions (push rbp, load rsp to rbp, pop rbp).

Also, if I change from typedef to using:

using MyFloatVector = float __attribute__((vector_size(64), aligned(4)));
MyFloatVector*vec_ptr = reinterpret_cast<MyFloatVector*>(float_ptr);
Something(*vec_ptr);

Again GCC generates the fluff instructions and Clang generates MOVAPS. Using auto here gives the same result.

So, does anyone have any idea what's happening under the hood, and is there a safe way to do the conversion. While there exists a working solution, IMO the discrepancies generated by typedef/using and explicit/auto make it far too unreliable to use with confidence--at the minimum I'd need a static assert to check that the generated uop when dereferencing the pointer is unaligned, which doesn't exist AFAIK.

In some cases I might want to have a MyFloatVector-reference to the memory area, which rules out using intrinsics.

Sample code: https://godbolt.org/z/caxScz. Includes ICC for "fun", which generates MOVUPS throughout.

  • `reinterpret_cast` is very often unsafe to use. What is wrong with [`_mm512_loadu_ps`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=vmovups&expand=3413,3413)? – chtz Jun 03 '20 at 17:46
  • It rules out using references, e.g. MyFloatVector& vec_ref = *vec_ptr; The objective is to implement a std::vector-like container that can operate on vectorized datatypes, hence e.g. []-operator requires the ability to form a reference to the internal storage. – user13673518 Jun 03 '20 at 18:35
  • Perhaps implement a custom reference object (which internally holds a pointer, but overloads `operator __m512()` and `operator=(__m512)` -- similar to how references are handled in `vector`. – chtz Jun 03 '20 at 21:11
  • Yes--that seems to be the only way to go. I **was** kind of hoping to get to be lazy here and just rely on the compiler doing the magic :-) Thanks for your help though! – user13673518 Jun 04 '20 at 15:18

1 Answers1

2

When you use reinterpret_cast you're telling the compiler that the argument points to a valid object of the requested type. That means that it has the same alignment requirements.

ICC is being more conservative here, while clang and GCC are trying to make your code go faster by assuming that you're actually adhering to the standard.

Keep in mind that the aligned attribute can only be used to increase alignment requirements, not to decrease them, so in your code you're just saying that the types have a minimum alignment of 4 bytes. If you add a static_assert(alignof(MyFloatVector) == 4, "Alignment should be 4") you'll probably see some failures, depending on how exactly you declare it.

Since you're not using __m512, _mm512_loadu_ps would work but probably isn't really the right way to go IMHO. The correct way to load unaligned data is to use memcpy (or __builtin_memcpy, since you're using vector extensions anyways). Compilers are really good about optimizing memcpy with known sizes, as long as you're using a relatively recent compiler you should end up with a vmovups on x86 with AVX-512F enabled.

nemequ
  • 16,623
  • 1
  • 43
  • 62
  • `vmovdqu` on an address that happens to be aligned is just as fast as `vmovdqa`. What GCC and clang are doing is helping you verify the alignment you promised, on the assumption that you did that on purpose and wanted to fault on misaligned. – Peter Cordes Jun 04 '20 at 02:52
  • 1
    Also, fun fact, GCC/clang headers implement `loadu` / `storeu` intrinsics with `typedef float __attribute__ ((vector_size(64), aligned(1), may_alias)) __m512_u;`. `__attribute__` really can let you tell GCC about under-aligned types (unlike ISO C `alignas`), and works for unaligned scalar types, too. See also [Is \`reinterpret\_cast\`ing between hardware SIMD vector pointer and the corresponding type an undefined behavior?](https://stackoverflow.com/q/52112605). But you should just use `_mm512_loadu_ps` instead of re-implementing it yourself or messing around with GNU C native vector stuff. – Peter Cordes Jun 04 '20 at 02:55
  • Thank you for your answers! I had read about `aligned()` being only able to increase alignment--omitting it does however change the behavior for GCC and Clang here (both then generate aligned load in all cases), so that made me confused as to whether the initial alignment was assumed to be from `float` or `float __attribute__ ((vector_size(64)))` in the example case. I had missed the `reinterpret_cast` assumption, it does make sense. It's still peculiar to me why `typedef` types' behavior differ from `using`, and same with `auto`vs explicit type. – user13673518 Jun 04 '20 at 15:16