How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?

Question

I don't have a particular use-case in mind; I'm asking if this is really a design flaw / limitation in Intel's intrinsics or if I'm just missing something.

If you want to combine a scalar float with an existing vector, there doesn't seem to be a way to do it without high-element-zeroing or broadcasting the scalar into a vector, using Intel intrinsics. I haven't investigated GNU C native vector extensions and the associated builtins.

This wouldn't be too bad if the extra intrinsic optimized away, but it doesn't with gcc (5.4 or 6.2). There's also no nice way to use pmovzx or insertps as loads, for the related reason that their intrinsics only take vector args. (And gcc doesn't fold a scalar->vector load into the asm instruction.)

__m128 replace_lower_two_elements(__m128 v, float x) {
  __m128 xv = _mm_set_ss(x);        // WANTED: something else for this step, some compilers actually compile this to a separate insn
  return _mm_shuffle_ps(v, xv, 0);  // lower 2 elements are both x, and the garbage is gone
}

gcc 5.3 -march=nehalem -O3 output, to enable SSE4.1 and tune for that Intel CPU: (It's even worse without SSE4.1; multiple instructions to zero the upper elements).

    insertps  xmm1, xmm1, 0xe    # pointless zeroing of upper elements.  shufps only reads the low element of xmm1
    shufps    xmm0, xmm1, 0      # The function *should* just compile to this.
    ret

TL:DR: the rest of this question is just asking if you can actually do this efficiently, and if not why not.

clang's shuffle-optimizer gets this right, and doesn't waste instructions on zeroing high elements (_mm_set_ss(x)), or duplicating the scalar into them (_mm_set1_ps(x)). Instead of writing something the compiler has to optimize away, shouldn't there be a way to write it "efficiently" in C in the first place? Even very recent gcc doesn't optimize it away, so this is a real (but minor) problem.

This would be possible if there was a scalar->128b equivalent of __m256 _mm256_castps128_ps256 (__m128 a). i.e. produce a __m128 with undefined garbage in upper elements, and the float in the low element, compiling to zero asm instructions if the scalar float/double was already in an xmm register.

None of the following intrinsics exist, but they should.

a scalar->__m128 equivalent of _mm256_castps128_ps256 as described above. The most general solution for the scalar-already-in-register case.
__m128 _mm_move_ss_scalar (__m128 a, float s): replace low element of vector a with scalar s. This isn't actually necessary if there's a general-purpose scalar->__m128 (previous bullet point). (The reg-reg form of movss merges, unlike the load form which zeros, and unlike movd which zeros upper elements in both cases. To copy a register holding a scalar float without false dependencies, use movaps).
__m128i _mm_loadzxbd (const uint8_t *four_bytes) and other sizes of PMOVZX / PMOVSX: AFAICT, there's no good safe way to use the PMOVZX intrinsics as a load, because the inconvenient safe way doesn't optimize away with gcc.
__m128 _mm_insertload_ps (__m128 a, float *s, const int imm8). INSERTPS behaves differently as a load: the upper 2 bits of the imm8 are ignored, and it always takes the scalar at the effective address (instead of an element from a vector in memory). This lets it work with addresses that aren't 16B-aligned, and work even without faulting if the float right before an unmapped page.

Like with PMOVZX, gcc fails to fold an upper-element-zeroing _mm_load_ss() into a memory operand for INSERTPS. (Note that if the upper 2 bits of the imm8 aren't both zero, then _mm_insert_ps(xmm0, _mm_load_ss(), imm8) can compile to insertps xmm0,xmm0,foo, with a different imm8 that zeros elements in vec as-if the src element was actually a zero produced by MOVSS from memory. Clang actually uses XORPS/BLENDPS in that case)

Are there any viable workarounds to emulate any of those that are both safe (don't break at -O0 by e.g. loading 16B that might touch the next page and segfault), and efficient (no wasted instructions at -O3 with current gcc and clang at least, preferably also other major compilers)? Preferably also in a readable way, but if necessary it could be put behind an inline wrapper function like __m128 float_to_vec(float a){ something(a); }.

Is there any good reason for Intel not to introduce intrinsics like that? They could have added a float->__m128 with undefined upper elements at the same time as adding _mm256_castps128_ps256. Is this a matter of compiler internals making it hard to implement? Perhaps specifically ICC internals?

The major calling conventions on x86-64 (SysV or MS __vectorcall) take the first FP arg in xmm0 and return scalar FP args in xmm0, with upper elements undefined. (See the x86 tag wiki for ABI docs). This means it's not uncommon for the compiler to have a scalar float/double in a register with unknown upper elements. This will be rare in a vectorized inner loop, so I think avoiding these useless instructions will mostly just save a bit of code size.

The pmovzx case is more serious: that is something you might use in an inner loop (e.g. for a LUT of VPERMD shuffle masks, saving a factor of 4 in cache footprint vs. storing each index padded to 32 bits in memory).

The pmovzx-as-a-load issue has been bothering me for a while now, and the original version of this question got me thinking about the related issue of using a scalar float in an xmm register. There are probably more use-cases for pmovzx as a load than for scalar->__m128.

I've wrestled with closely related problems to this on MSVC countless times. It generates truly bogus code whenever you get the `_mm_load_*` or `_mm_set_*` intrinsics involved. For the example given in the question here, you get no less than four instructions (!): `movaps xmm2, xmm1; xorps xmm3, xmm3; movss xmm3, xmm2; shufps xmm0, xmm3, 0`. I've basically just given up. As long as I can get it to generate assembly that doesn't spill to memory, I call it a victory. — Cody Gray - on strike, Sep 04 '16 at 15:45

score 6 · Answer 1 · edited May 23 '17 at 11:47

It's doable with GNU C inline asm, but this is ugly and defeats many optimizations, including constant-propagation (https://gcc.gnu.org/wiki/DontUseInlineAsm). This will not be the accepted answer. I'm adding this as an answer instead of part of the question so the question ~~stays short~~ isn't huge.

// don't use this: defeating optimizations is probably worse than an extra instruction
#ifdef __GNUC__
__m128 float_to_vec_inlineasm(float x) {
  __m128 retval;
  asm ("" : "=x"(retval) : "0"(x));   // matching constraint: provide x in the same xmm reg as retval
  return retval;
}
#endif

This does compile to a single ret, as desired, and will inline to let you shufps a scalar into a vector:

gcc5.3
float_to_vec_and_shuffle_asm(float __vector(4), float):
    shufps  xmm0, xmm1, 0       # tmp93, xv,
    ret

See this code on the Godbolt compiler explorer.

This is obviously trivial in pure assembly language, where you don't have to fight with a compiler to get it not to emit instructions you don't want or need.

I haven't found any real way to write a __m128 float_to_vec(float a){ something(a); } that compiles to just a ret instruction. An attempt for double using _mm_undefined_pd() and _mm_move_sd() actually makes worse code with gcc (see the Godbolt link above). None of the existing float->__m128 intrinsics help.

Off-topic: actual _mm_set_ss() code-gen strategies: When you do write code that has to zero upper elements, compilers pick from an interesting range of strategies. Some good, some weird. The strategies also differ between double and float on the same compiler (gcc or clang), as you can see on the Godbolt link above.

One example: __m128 float_to_vec(float x){ return _mm_set_ss(x); } compiles to:

    # gcc5.3 -march=core2
    movd    eax, xmm0      # movd xmm0,xmm0 would work; IDK why gcc doesn't do that
    movd    xmm0, eax
    ret

    # gcc5.3 -march=nehalem
    insertps        xmm0, xmm0, 0xe
    ret

    # clang3.8 -march=nehalem
    xorps   xmm1, xmm1
    blendps xmm0, xmm1, 14          # xmm0 = xmm0[0],xmm1[1,2,3]
    ret

Given `__m128 r;` and using `r.f32[0] = x;` in MSVC and `r[0] = x;` in clang you can reach the same as `asm ("" : "=x"(retval) : "0"(x));`, without losing constant folding. — plasmacel, Nov 06 '16 at 20:58
@plasmacel: Nice idea, but gcc compiles it the same as `_mm_set_ss`, while ICC makes a total mess. https://godbolt.org/g/RC6CWb. gcc really likes to break false dependencies with PXOR or integer XOR; I wonder if that's related to it being really keen to zero out registers used for uninitialized variables, and to zeroing out the rest of the *same* register. — Peter Cordes, Nov 06 '16 at 21:08
Yes, for GCC and ICC it falls back to the asm version. By the way clang 3.9 doesn't even compile the asm version, probably related to a compiler bug. — plasmacel, Nov 06 '16 at 21:10
With clang 3.6 or later this seems to work: `__m128 retval; memcpy(&retval, &x, sizeof x); return retval;` (gcc generates an `insertps` for that, initializing `retval = _mm_undefined_ps();` generates a zeroing instruction for some clang versions). And generally clang 4.0 or later seem to be able to optimize away the overhead in many of the "clean" variants in your godbolt link. — chtz, May 08 '23 at 12:56

How to merge a scalar into a vector without the compiler wasting an instruction zeroing upper elements? Design limitation in Intel's intrinsics?

1 Answers1

Linked

Related