Related:
AVX has instructions for inserting and extracting 16 and 32 bit integers into __m256i vectors:
No it doesn't, _mm256_insert_epi16
and epi32
intrinsics are "fake"; they have to be emulated with multiple instructions, the same way _mm_set_epi32(a,b,c,d)
isn't an intrinsic for any single instruction.
IDK why Intel chose to provide them for AVX1/2 but not AVX512 versions; maybe they later realized they shouldn't have provided them for AVX2, to avoid fooling people into writing inefficient code if they were assuming those intrinsics only cost one shuffle. But they can't remove the existing ones without breaking existing code.
vpinsrd ymm_dst, ymm_src, r/m32, imm8
(or ZMM) unfortunately don't exist, only xmm. (https://www.felixcloutier.com/x86/pinsrb:pinsrd:pinsrq). The XMM version is unusable on a __m256i
because it zeros the upper 128 bits. See Using ymm registers as a "memory-like" storage location (You can insert into the low 128 bits of a YMM by using the legacy SSE encoding of pinsrd xmm, r/m32, imm
, but that's disastrously slow on Haswell and Ice Lake because of how SSE/AVX transition penalties work there. But fine on Skylake or Ryzen. Still, compilers will never emit that.)
_mm256_insert_epi32
might compile with AVX2 to a broadcast load and vpblendd
to insert a dword from memory. Or worse, with an integer that was in a register a compiler might vmovd
it to an xmm reg, broadcast that to YMM, then blend. (Like I showed doing by hand in Move an int64_t to the high quadwords of an AVX2 __m256i vector)
The "appropriate" implementation depends on surrounding code.
If you have more than 1 element to insert, you might want to shuffle them together before inserting. Or even consider vector store, multiple scalar stores, then vector reload, despite the store-forwarding stall. Or scalar stores / vector reload to feed a blend if the latency critical path goes through the vector, not the scalars. Probably worth it if you have lots of small scalar elements.
However, for a single insert AVX512F actually has some nice capabilities: it has 2-input shuffles like vpermt2d
that you could use to insert an element from the bottom of one x/y/zmm into any position in another vector (taking all the rest of the destination elements from that other vector as a source).
But most useful here is masked broadcast: uops.info confirms that VPBROADCASTW zmm0{k1}, eax
is a single-uop instruction, with 3 cycle latency from vector to vector (for the merge), and from mask to vector. And <= 5 cycle latency from eax to merge result. The only problem is setting up the mask, but hopefully that can get hoisted out of a loop for an invariant insert-position.
#include <immintrin.h>
#include <stdint.h>
__m512i _mm512_insert32(__m512i target, uint32_t x, const int pos)
{
return _mm512_mask_set1_epi32(target, 1UL<<pos, x);
}
compiles on Godbolt to this asm:
# gcc8.3 -O3 -march=skylake-avx512
_mm512_insert32(long long __vector(8), unsigned int, int):
mov eax, 1
shlx eax, eax, esi
kmovw k1, eax # mask = 1<<pos
vpbroadcastd zmm0{k1}, edi
ret
(gcc9 wastes an extra instructions copying ESI for no reason).
With a compile-time-constant pos
you get code like mov eax,2
/ kmovw k1, eax
; masked-broadcast is probably still the best choice.
This works for 8, 16, 32, or 64-bit elements. 8 and 16 of course require AVX512BW for the vpbroadcastb/w
narrow broadcast, while 32 and 64 only require AVX512F.
Extract:
Just shuffle the element you want to the bottom of a __m512i
where you can use _mm_cvtsi128_si32
. (After _mm512_castsi512_si128
). A useful shuffle is valignd
to shift or rotate by dword elements, letting you efficiently get any element to the bottom of a vector without needing a vector control. https://www.felixcloutier.com/x86/valignd:valignq