How to emulate _mm256_loadu_epi32 with gcc or clang?

Question

Intel's intrinsic guide lists the intrinsic _mm256_loadu_epi32:

_m256i _mm256_loadu_epi32 (void const* mem_addr);
/*
   Instruction: vmovdqu32 ymm, m256
   CPUID Flags: AVX512VL + AVX512F
   Description
       Load 256-bits (composed of 8 packed 32-bit integers) from memory into dst.
       mem_addr does not need to be aligned on any particular boundary.
   Operation
   a[255:0] := MEM[mem_addr+255:mem_addr]
   dst[MAX:256] := 0
*/

But clang and gcc do not provide this intrinsic. Instead they provide (in file avx512vlintrin.h) only the masked versions

_mm256_mask_loadu_epi32 (__m256i, __mmask8, void const *);
_mm256_maskz_loadu_epi32 (__mmask8, void const *);

which boil down to the same instruction vmovdqu32. My question: how can I emulate _mm256_loadu_epi32:

 inline _m256i _mm256_loadu_epi32(void const* mem_addr)
 {
      /* code using vmovdqu32 and compiles with gcc */
 }

without writing assembly, i.e. using only intrinsics available?

Since you don't need masking (and therefore the element size is irrelevant) you can just use [`_mm256_loadu_si256`](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#text=_mm256_loadu_si256&expand=3418). — Paul R, Jan 08 '20 at 16:38
@PaulR Is this better than `_mm256_maskz_epi32(0xffu,ptr)`? Would you promote this comment to an answer? — Walter, Jan 08 '20 at 16:54
Yes, it's better. The compiler can always use an AVX512 encoding if it wants to load into ymm16..31, otherwise you want it to use a shorter `vmovdqu`. Related: [What is the difference between \_mm512\_load\_epi32 and \_mm512\_load\_si512?](//stackoverflow.com/q/53905757) — Peter Cordes, Jan 08 '20 at 16:56
Note that with `_mm256_loadu_si256` you need to cast the input-pointer to `const __m256i*` (so not a bad idea, to encapsulate that into an inlined function) — chtz, Jan 08 '20 at 17:00

Peter Cordes · Accepted Answer · 2022-04-09T21:06:24.463

Just use _mm256_loadu_si256 like a normal person. The only thing the AVX512 intrinsic gives you is a nicer prototype (const void* instead of const __m256i*) so you don't have to write ugly casts.

@chtz suggests out that you might still want to write a wrapper function yourself to get the void* prototype. But don't call it _mm256_loadu_epi32; some future GCC version will probably add that for compat with Intel's docs and break your code.

From another perspective, it's unfortunate that compilers don't treat it as an AVX1 intrinsic, but I guess compilers which don't optimize intrinsics, and which let you use intrinsics from ISA extensions you haven't enabled, need this kind of clue to know when they can use ymm16-31.

You don't even want the compiler to emit vmovdqu32 ymm when you're not masking; vmovdqu ymm is shorter and does exactly the same thing, with no penalty for mixing with EVEX-encoded instructions. The compiler can always use an vmovdqu32 or 64 if it wants to load into ymm16..31, otherwise you want it to use a shorter VEX-coded AVX1 vmovdqu.

I'm pretty sure that GCC treats _mm256_maskz_epi32(0xffu,ptr) exactly the same as _mm256_loadu_si256((const __m256i*)ptr) and makes the same asm regardless of which one you use. It can optimize away the 0xffu mask and simply use an unmasked load, but there's no need for that extra complication in your source.

But unfortunately GCC9 and earlier will pessimize to vmovdqu32 ymm0, [mem] when AVX512VL is enabled (e.g. -march=skylake-avx512) even when you write _mm256_loadu_si256. This was a missed-optimization, ~~GCC Bug 89346~~.

It doesn't matter which 256-bit load intrinsic you use (except for aligned vs. unaligned) as long as there's no masking.

@JL2210: That phrasing is a humorous way to indicate that `_mm256_loadu_si256` is the normal / standard way that you'll find in lots of code, or at least that's how I intend it. It also implies that there's no downside to doing it this way, since I'm recommending it. (And the rest of the answer explains in more detail the lack of downside). I'm also implying that once you understand that mixing AVX and AVX512 intrinsics and/or instructions isn't a problem it's also the obvious solution. I don't think it's very likely to come across as rude, but correct me if I'm wrong. — Peter Cordes, Jan 08 '20 at 17:43
Okay, so I did `_mm256_loadu_si256((const __m256i*)(k))`, but then clang tells me: `warning: cast from 'const std::int32_t *' (aka 'const int *') to 'const __m256i *' increases required alignment from 4 to 32` — Walter, Jan 08 '20 at 18:27
@Walter: that's weird, do you have a Godbolt MCVE link for that I can look at? Because that's exactly what you'd do in AVX1 / AVX2 code before these `void*` intrinsics were even available. Dereferencing `__m256i*` is like `load` not `loadu`, so yes it does increase the alignment requirement, but passing it to `loadu` doesn't do that. That warning is spurious as long as you never directly deref that pointer. — Peter Cordes, Jan 08 '20 at 18:50

How to emulate _mm256_loadu_epi32 with gcc or clang?

1 Answers1

Linked

Related