`vmovdqu8` / 16 / 32 / 64 instructions and `_mm_loadu_epi8` / 16 / 32 / 64 intrinsics purpose

Question

There is movdqu available via _mm_loadu_si128 that requires SSE2.

There is vmovdqu8 (16, 32, 64) available via _mm_loadu_epi8 (16, 32, 64) available via AVX512BW + AVX512VL or AVX512F + AVX512VL.

What is the purpose of the later if they apparently do the same?

If the purpose is the mask, then why are unmasked _mm_loadu_epi8 exposed as intrinsics?

The asm instructions exist to support masked loads/stores. The non-masking versions of the intrinsics are useless except for orthogonality of function names existing; not all compilers even provide them. (*[error: '\_mm512\_loadu\_epi64' was not declared in this scope](https://stackoverflow.com/q/53604986)*). Or maybe to communicate to compilers like MSVC that they can use AVX-512VL machine code in this code path, allowing X/YMM16-31. Also they take a `void*`, unlike the badly designed older intrinsics that require casting your pointer to `(const __m128i*)`. — Peter Cordes, Aug 14 '23 at 20:25
re @PeterCordes "maybe to communicate to compilers like MSVC" -- nope, it doesn't work this way. For Intel compiler there's `_allow_cpu_features`. I'm not aware of a way for mixing different levels of assumed instruction set for MSVC, except compiling in different TUs with different `/arch` option https://godbolt.org/z/9aaha1h8r — Alex Guteniev, Aug 15 '23 at 06:04
/sigh, I've always thought MSVC's model of not having to enable instructions sets before using them was weird; disappointing that they're not even making the most of it. And it's probably hard for the optimizer to do a good job while not introducing newer instructions into code-paths that didn't have them in the source. It's disappointing but somewhat not surprising that MSVC doesn't even use `pshufb` for `set1_epi8` in the same basic block where it used `vmovdqu`, let alone AVX-512BW+VL `vpbroadcastb xmm0, dl` — Peter Cordes, Aug 15 '23 at 06:09
I was wondering why ICC did such a bad job; you only let it use AVX512F, not AVX512BW. https://godbolt.org/z/WaovT8vG3 shows it will use `vpbroadcastb xmm1, esi` (instead of vmovd + vpbroadcastb), but still fails to fold the load into a memory source operand for `vpxor`, even with `-march=icelake-client` so `_mm_load_epu8` definitely isn't a fancy instruction outside the baseline it's targeting. I also included GCC compiling it to the expected 2 instructions, not taking `_mm_loadu_epi8` as a request for a literal `vmovdqu8` like ICC does. — Peter Cordes, Aug 15 '23 at 06:17
@PeterCordes MSVC has all ways to generate `set1_epi8`, just they are only available via `/arch` switch https://godbolt.org/z/zPPGe5bMb — Alex Guteniev, Aug 15 '23 at 06:25
@AlexGuteniev: Right. My point was that if you're going to allow using intrinsics for `/arch` settings you didn't enable, it's not actually usable to make efficient SSSE3 or SSE4.1 code paths that involve broadcasts and stuff. Well I guess `_mm_set1_epi8` and `epi16` might be the only common intrinsics that should expand differently depending on ISA feature level until we get to AVX-512, and a lot of code won't use them at all with runtime variables. So I guess if people want efficient set1_epi8 with MSVC without enabling `/arch:AVX`, they have to manually `_mm_cvtsi32_si128` (movd) etc. — Peter Cordes, Aug 15 '23 at 06:38
@PeterCordes, added a Developer Community issue https://developercommunity.visualstudio.com/t/Allow-using-newer-vector-instruction-on-/10440707 Not sure if there would be some reaction. I'm not sure if detection by instruction presence is a good thing, but I would certainly like explicit wat of marking code paths. Not for `*_set1_*`, rather for auto-vectorization of scalar loops or `memcpy`s. — Alex Guteniev, Aug 15 '23 at 06:47

`vmovdqu8` / 16 / 32 / 64 instructions and `_mm_loadu_epi8` / 16 / 32 / 64 intrinsics purpose

0 Answers0