4

I just noticed absence of _mm256_insert_pd()/_mm256_insert_ps()/_mm_insert_pd(), also _mm_insert_ps() exists but with some weird usage pattern.

While _mm_insert_epi32() and _mm256_insert_epi32() and other integer variants exist.

Is it some intentional idea of Intel not to implement float/double variants for some reason? And what is the correct and most-performant way to set single float/double at given position (not only 0th) of SSE/AVX registers?

I implemented following AVX-double variant of insert, which works, but still maybe there is a better way to do this:

Try it online!

template <int I>
__m256d _mm256_insert_pd(__m256d a, double x) {
    int64_t ix;
    std::memcpy(&ix, &x, sizeof(x));
    return _mm256_castsi256_pd(
        _mm256_insert_epi64(_mm256_castpd_si256(a), ix, I)
    );
}

As I see extract float/double variants are also absent in SSE/AVX for some reason. I know only _mm_extract_ps() exists, but not others.

Do you know why insert and extract are absent for float/double SSE/AVX?

Arty
  • 14,883
  • 6
  • 36
  • 69

1 Answers1

4

A scalar float/double is just the bottom element of an XMM/YMM register already, and there are various FP shuffle instructions including vinsertps and vmovlhps that can (in asm) do the insertion of a 32-bit or 64-bit element. There aren't versions of those which work on 256-bit YMM registers, though, and general 2-register shuffles aren't available until AVX-512, and only with a vector control.

Still much of the difficulty is in the intrinsics API, making it harder to get at the useful asm operations.


One not-bad way is to broadcast a scalar float or double and blend, partly because a broadcast is one of the ways that intrinsics already provide for getting a __m256d that contains your scalar1.

Immediate-blend instructions can efficiently replace one element of another vector, even in the high half2. They have good throughput and latency, and back-end port distribution, on most AVX CPUs. They require SSE4.1, but with AVX they're always available.

(See also Agner Fog's VectorClass Library (VCL) for C++ templates for replacing an element of a vector; with various SSE / AVX feature levels. Including with runtime-variable index, but often designed to optimize down to something good for compile-time constants, e.g. a switch on the index like in Vec4f::insert())


float into __m256

template <int pos>
__m256 insert_float(__m256 v, float x) {
    __m256 xv = _mm256_set1_ps(x);
    return _mm256_blend_ps(v, xv, 1<<pos);
}

The best case is with position=0. (Godbolt)

auto test2_merge_0(__m256 v, float x){
    return insert_float<0>(v,x);
}

clang notices that the broadcast is redundant and optimizes it away:

test2_merge_0(float __vector(8), float):
        vblendps        ymm0, ymm0, ymm1, 1             # ymm0 = ymm1[0],ymm0[1,2,3,4,5,6,7]
        ret

But clang gets too clever for its own good sometimes, and pessimizes this to

test2_merge_5(float __vector(8), float):  # clang(trunk) -O3 -march=skylake
        vextractf128    xmm2, ymm0, 1
        vinsertps       xmm1, xmm2, xmm1, 16    # xmm1 = xmm2[0],xmm1[0],xmm2[2,3]
        vinsertf128     ymm0, ymm0, xmm1, 1
        ret

Or when merging into a zeroed vector, clang uses vxorps-zeroing and then a blend, but gcc does better:

test2_zero_0(float):           # GCC(trunk) -O3 -march=skylake
        vinsertps       xmm0, xmm0, xmm0, 0xe
        ret

Footnote 1:
Which is a problem for intrinsics; many intrinsics that you could use with a scalar float/double are only available with vector operands, and compilers don't always manage to optimize away _mm_set_ss or _mm_set1_ps or whatever when you only actually read the bottom element. A scalar float/double is either in memory or the bottom element of an X/YMM register already, so in asm it's 100% free to use vector shuffles on scalar floats / doubles that are already loaded into a register.

But there's no intrinsic to tell the compiler you want a vector with don't-care elements outside the bottom. This means you have to write your source in a way that looks like it's doing extra work, and rely on the compiler to optimize it away. How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?

Footnote 2:
Unlike vpinsrq. As you can see from Godbolt, your version compiles very inefficiently, especially with GCC. They have to handle the high half of the __m256d separately, although GCC finds way fewer optimizations and makes asm that's closer to your very inefficient code. BTW, make the function return a __m256d instead of assigning to a volatile; that way you have less noise. https://godbolt.org/z/Wrn7n4soh)

_mm256_insert_epi64 is a "compound" intrinsic / helper function: vpinsrq is only available in vpinsrq xmm, xmm, r/m64, imm8 form, which zero-extends the xmm register into the full Y/ZMM. Even clang's shuffle optimizer (which finds vmovlhps to replace the high half of an XMM with the low half of another XMM) still ends up extracting and re-inserting the high half when you blend into an existing vector instead of zero.


The asm situation is that the scalar operand for extractps is r/m32, not an XMM register, so it's not useful for extracting a scalar float (except to store it to memory). See my answer on the Q&A Intel SSE: Why does `_mm_extract_ps` return `int` instead of `float`? for more about it and insertps.

insertps xmm, xmm/m32, imm can select a source float from another vector register, so the only intrinsic takes two vectors, leaving you with the How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics? problem of convincing the compiler not to waste instructions setting elements in a __m128 when you only care about the bottom one.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • Can you please tell me so what's the most-performant way to `extract` single element (not at 0th position)? Is it through doing shuffle/permute and getting lower element (0th position)? – Arty Jun 10 '21 at 08:55
  • @Arty: See also the last edit, including https://github.com/vectorclass/version2/blob/master/vectorf128.h#L603 for examples of pre-SSE4.1 ways if you need legacy SSE. – Peter Cordes Jun 10 '21 at 08:56
  • @Arty: Oh, yes, shuffle to the bottom for `_mm_cvtss_f32` (which is free). One way to shuffle to the bottom is (ironically) `insertps`. You can also use Agner Fog's VectorClass library, which has insert/extract functions and overloaded `operator[]`. But I'm not sure they're optimal for compile-time-constant indices; clang is pretty good about optimizing shuffles, but GCC might not be. Still, store to array / scalar access to array is something compilers are pretty good at optimizing into a shuffle, so it's actually ok. – Peter Cordes Jun 10 '21 at 08:59
  • Do you know if there is any special reason why Intel didn't implement `insert`/`extract` functions for float/double? Seem like very useful and common things to set or get just one single element of array or vector. Also I wonder why these functions are quite slow for non-0th element, is there anything difficult for CPU just to read-out part of register? – Arty Jun 10 '21 at 10:26
  • 1
    @Arty: As I said, there already are FP shuffles for XMM. `shufps xmm0,xmm0, n` works just fine. So does `movhlps xmm1, xmm0` to merge the high half into another register, or `unpckhpd xmm0,xmm0`. Or to store to memory, `extractps [rdi], xmm0, 2` works just fine (and smart compilers will sometimes optimize `_mm_shuffle_ps` + `*ptr = _mm_cvtss_f32(v)` into that; I probably gave an example of that on one of the earlier linked Q&As about insert/extract). And `insertps` is a fully flexible XMM insert/extract instruction that can copy 1 element from anywhere to anywhere. – Peter Cordes Jun 10 '21 at 10:55
  • 1
    @Arty: As for why not AVX YMM instructions: AVX1 only provides lane-crossing shuffles with 128-bit granularity, to keep the shuffle hardware simple. Even AVX2 [omits `vpermb`](https://stackoverflow.com/questions/37980209/where-is-vpermb-in-avx2), only have 2 in-lane 128-bit shuffles from `vpshufb ymm`. **If your code spends a lot of time getting elements into / out of vectors 1 at a time (especially for vectors as wide as YMM), it's not using SIMD efficiently**. Providing better instructions to speed that up won't make a big difference. – Peter Cordes Jun 10 '21 at 10:58
  • 1
    @Arty: *is there anything difficult for CPU just to read-out part of register* - somewhat. Every possible bit that could be a source for a given result bit needs gates connecting it to that output signal. Those gates need to control which of the possible inputs will actually make it to that output. The more possible sources there are, the more logic it takes to build. AVX-512 does have a lot of powerful shuffles, though: [Using ymm registers as a "memory-like" storage location](https://stackoverflow.com/q/50915381) shows how masked broadcasts from integer regs are useful. – Peter Cordes Jun 10 '21 at 11:02
  • 1
    @Arty From the ISA standpoint, there is no reason to have separate shuffle/insert/extract/blend/logic instructions for INT and FP, as long as the necessary element sizes are supported. Those kinds of instructions treat data as bags of bits, so applying e.g. integer shuffle to a vector register that contains FP elements has the same effect as a FP shuffle. The reason why these instructions are separate for INT and FP stems from hardware, which implemented separate execution units for INT and FP, including shufflers. – Andrey Semashev Jun 10 '21 at 16:44
  • Passing data between those domains incurred a performance penalty, so it made sense to have seemingly equivalent instructions specialized for INT and FP. In recent Intel architectures, I believe, the distinction between these INT and FP instructions is largely removed. And in AMD architectures there was probably none from the start. So there is little reason to keep duplicating instructions like that in future ISA extensions. – Andrey Semashev Jun 10 '21 at 16:50
  • @AndreySemashev: There are still separate FP vs. SIMD-int forwarding networks. Shuffle execution units specifically are expensive and often connected to both, instead of replicating like for blends. AMD also has bypass delay between SIMD-int vs. FP, see Agner Fog's microarch guide, https://agner.org/optimize/. New extensions like AVX2 and AVX-512 still have separate opcodes for `vpermpd` vs. `vpermq`, those mnemonics aren't just synonyms for the same opcode. (And AVX-512 `vmovdqa64` vs. `vmovapd`.) – Peter Cordes Jun 10 '21 at 16:56
  • @PeterCordes Given that the shuffler is shared, I would assume there is no penalty of mismatching shuffle instructions domain with adjacent arithmetic instructions domain. Is that not the case? – Andrey Semashev Jun 10 '21 at 18:50
  • @AndreySemashev: Yes that's correct for shuffle instructions on modern CPUs. (I meant that the concept isn't entirely gone). On some older CPUs, e.g. Conroe or K10, the shuffle unit was in one domain or the other, and some FP shuffles counted as integer for forwarding purposes, IIRC. (And maybe some the other way around, I forget, but Agner's instruction tables list the domain for those CPUs.) – Peter Cordes Jun 10 '21 at 23:00
  • @PeterCordes Can you also please put a look at my [this question](https://stackoverflow.com/questions/67952067/)? – Arty Jun 12 '21 at 19:00
  • @Arty: I already saw it in my question feed, but thanks for the heads up. – Peter Cordes Jun 12 '21 at 19:03