How to implement 16 and 32 bit integer insert and extract operations with AVX-512?

Question

AVX has instructions for inserting and extracting 16 and 32 bit integers into __m256i vectors: _mm256_insert_epi16, _mm256_insert_epi32, _mm256_extract_epi16, _mm256_extract_epi32.

However, AVX-512 doesn't seem to have equivalent instructions. What is the appropriate way to implement these methods for __m512i vectors? i.e.

__m512i _mm512_insert_epi16(__m512i a, __int16 i, int index)
__m512i _mm512_insert_epi32(__m512i a, __int32 i, int index)
int _mm512_extract_epi16(__m512i a, int index)
int _mm512_extract_epi32(__m512i a, int index)

Peter Cordes · Answer 1 · 2021-08-26T21:26:55.423

How can I write a QuadWord from AVX512 register zmm26 to the rax register? - most of that applies to extracting 32 and 16-bit elements, too.
Move an int64_t to the high quadwords of an AVX2 __m256i vector shows inserting, and again most of that should work for 32 and probably 16-bit elements. (Although vpblendw repeats the blend control for both lanes, unlike vpblendd). And that doesn't take advantage of AVX512 for e.g. a merge-masked broadcast.
Loading an xmm from GP regs shows how AVX512 can use merge-masked broadcasts. But I didn't bother writing intrinsics for the asm there.

AVX has instructions for inserting and extracting 16 and 32 bit integers into __m256i vectors:

No it doesn't, _mm256_insert_epi16 and epi32 intrinsics are "fake"; they have to be emulated with multiple instructions, the same way _mm_set_epi32(a,b,c,d) isn't an intrinsic for any single instruction.

IDK why Intel chose to provide them for AVX1/2 but not AVX512 versions; maybe they later realized they shouldn't have provided them for AVX2, to avoid fooling people into writing inefficient code if they were assuming those intrinsics only cost one shuffle. But they can't remove the existing ones without breaking existing code.

vpinsrd ymm_dst, ymm_src, r/m32, imm8 (or ZMM) unfortunately don't exist, only xmm. (https://www.felixcloutier.com/x86/pinsrb:pinsrd:pinsrq). The XMM version is unusable on a __m256i because it zeros the upper 128 bits. See Using ymm registers as a "memory-like" storage location (You can insert into the low 128 bits of a YMM by using the legacy SSE encoding of pinsrd xmm, r/m32, imm, but that's disastrously slow on Haswell and Ice Lake because of how SSE/AVX transition penalties work there. But fine on Skylake or Ryzen. Still, compilers will never emit that.)

_mm256_insert_epi32 might compile with AVX2 to a broadcast load and vpblendd to insert a dword from memory. Or worse, with an integer that was in a register a compiler might vmovd it to an xmm reg, broadcast that to YMM, then blend. (Like I showed doing by hand in Move an int64_t to the high quadwords of an AVX2 __m256i vector)

The "appropriate" implementation depends on surrounding code.

If you have more than 1 element to insert, you might want to shuffle them together before inserting. Or even consider vector store, multiple scalar stores, then vector reload, despite the store-forwarding stall. Or scalar stores / vector reload to feed a blend if the latency critical path goes through the vector, not the scalars. Probably worth it if you have lots of small scalar elements.

However, for a single insert AVX512F actually has some nice capabilities: it has 2-input shuffles like vpermt2d that you could use to insert an element from the bottom of one x/y/zmm into any position in another vector (taking all the rest of the destination elements from that other vector as a source).

But most useful here is masked broadcast: uops.info confirms that VPBROADCASTW zmm0{k1}, eax is a single-uop instruction, with 3 cycle latency from vector to vector (for the merge), and from mask to vector. And <= 5 cycle latency from eax to merge result. The only problem is setting up the mask, but hopefully that can get hoisted out of a loop for an invariant insert-position.

#include <immintrin.h>
#include <stdint.h>
__m512i _mm512_insert32(__m512i target, uint32_t x, const int pos)
{
    return _mm512_mask_set1_epi32(target, 1UL<<pos, x);
}

compiles on Godbolt to this asm:

# gcc8.3 -O3 -march=skylake-avx512
_mm512_insert32(long long __vector(8), unsigned int, int):
        mov     eax, 1
        shlx    eax, eax, esi
        kmovw   k1, eax                    # mask = 1<<pos
        vpbroadcastd    zmm0{k1}, edi
        ret

(gcc9 wastes an extra instructions copying ESI for no reason).

With a compile-time-constant pos you get code like mov eax,2 / kmovw k1, eax; masked-broadcast is probably still the best choice.

This works for 8, 16, 32, or 64-bit elements. 8 and 16 of course require AVX512BW for the vpbroadcastb/w narrow broadcast, while 32 and 64 only require AVX512F.

Extract:

Just shuffle the element you want to the bottom of a __m512i where you can use _mm_cvtsi128_si32. (After _mm512_castsi512_si128). A useful shuffle is valignd to shift or rotate by dword elements, letting you efficiently get any element to the bottom of a vector without needing a vector control. https://www.felixcloutier.com/x86/valignd:valignq

What is `_mm_cvtsi128_epi32`? I don't see this in the intrinsics guide... — Daniel, Oct 09 '19 at 15:27
@Daniel: I meant `_mm_cvtsi128_si32`, sorry. The intrinsic names are much harder to remember than asm mnemonics. >. — Peter Cordes, Oct 09 '19 at 17:04
Note that ICL goes back to the HSW way of transition penalties for mixing in legacy SSE. So it seems like that approach will be slow in the future too. — BeeOnRope, Oct 10 '19 at 02:43

Daniel · Answer 2 · 2019-10-18T15:53:20.263

To complete Peter's answer, here are implementations of 16 and 32 bit insert/extract methods:

#if defined(__GNUC__)

int _mm512_cvtsi512_si32(__m512i a)
{
    __v16si b = (__v16si) a;
    return b[0];
}

#endif

__m512i _mm512_insert_epi16(__m512i target, const std::int16_t x, const int index)
{
    return _mm512_mask_set1_epi16(target, 1UL << index, x);
}
static inline __m512i _mm512_insert_epi32(__m512i target, const std::int32_t x, const int index)
{
    return _mm512_mask_set1_epi32(target, 1UL << index, x);
}

template <int index>
int _mm512_extract_epi32(__m512i target)
{
    return _mm512_cvtsi512_si32(_mm512_alignr_epi32(target, target, index));
}
template <int index>
int  _mm512_extract_epi16(__m512i target)
{
    return (_mm512_extract_epi32<index / 2>(target) >> (index % 2 ? 16 : 0)) & 0xFFFF;
}

See example.

How to implement 16 and 32 bit integer insert and extract operations with AVX-512?

2 Answers2

Extract: