What is the difference between _mm512_load_epi32 and _mm512_load_si512?

Question

The Intel intrinsics guide states simply that _mm512_load_epi32:

Load[s] 512-bits (composed of 16 packed 32-bit integers) from memory into dst

and that _mm512_load_si512:

Load[s] 512-bits of integer data from memory into dst

What is the difference between these two? The documentation isn't clear.

There's no difference, it's just silly redundant naming. All AVX512 instructions can be used with masking (e.g. `vmovdqa32` can do a masked load), but these intrinsics are for the no-masking version. Maybe a duplicate of [error: '\_mm512\_loadu\_epi64' was not declared in this scope](https://stackoverflow.com/q/53604986) — Peter Cordes, Dec 23 '18 at 18:19
Thanks @PeterCordes, that makes sense (I've seen lots of discussions criticizing AVX512 for a number of similar reasons, too). Mind making an answer so I can mark as accepted? The linked Q has interesting information but isn't a dupe. — Qix - MONICA WAS MISTREATED, Dec 23 '18 at 18:27
Already in the process of doing that now, after deciding that answering this would make more sense than editing my answer on the other Q to include some of the explanation in comments there. BTW, I hope you mean criticizing the intrinsic naming. AVX512 is very nice, it's just that Intel is bad at naming their intrinsics, and have been since they started naming shuffles `permute` vs. `shuf` and made AVX1 `vpermilps` vs. AVX2 `vpermps` super-confusing. — Peter Cordes, Dec 23 '18 at 18:28

score 14 · Accepted Answer · answered Dec 23 '18 at 18:55

There's no difference, it's just silly redundant naming. Use _mm512_load_si512 for clarity. Thanks, Intel. As usual, it's easier to understand the underlying asm for AVX512, and then you can see what the clumsy intrinsic naming is trying to say. Or at least you can understand how we ended up with this mess of different documentation suggesting _mm512_load_epi32 vs. _mm512_load_si512.

Almost all AVX512 instructions support merge-masking and zero-masking. (e.g. vmovdqa32 can do a masked load like vmovdqa32 zmm0{k1}{z}, [rdi] to zero vector elements where k1 had a zero bit), which is why different element-size versions of things like vector loads and bitwise operations exist. (e.g. vpxord vs. vpxorq).

But these intrinsics are for the no-masking version. The element-size is totally irrelevant. I'm guessing _mm512_load_epi32 exists for consistency with _mm512_mask_load_epi32 (merge-masking) and _mm512_maskz_load_epi32 (zero-masking). See the docs for the vmovdqa32 asm instruction.

e.g. _mm512_maskz_loadu_epi64(0x55, x) zeros the odd elements for free while loading. (At least it's free if the cost of putting 0x55 into a k register can be hoisted out of a loop. And if we haven't defeated the chance for the compiler to fold a load into a memory operand for an ALU instruction.)

When elements are all loaded into the destination unchanged, element boundaries are meaningless. That's why AVX2 and earlier don't have different element-size versions of bitwise booleans like _mm_xor_si128 and loads/stores like _mm_load_si128.

Some compilers don't support the element-width names for unaligned unmasked loads. e.g. current gcc doesn't support _mm512_loadu_epi64 even though it's supported _mm512_load_epi64 since the first gcc version to support AVX512 intrinsics at all. (See error: '_mm512_loadu_epi64' was not declared in this scope)

There are no CPUs where the choice of vmovdqa64 vs. vmovdqa32 matters at all for efficiency, so there's zero point in trying to hint the compiler to use one or the other, regardless of the natural element width of your data.

Only FP vs. integer might matter for loads, and Intel's intrinsics already uses different types (__m512 vs. __m512i) for that.

What is the difference between _mm512_load_epi32 and _mm512_load_si512?

1 Answers1

Linked