I don't have a particular use-case in mind; I'm asking if this is really a design flaw / limitation in Intel's intrinsics or if I'm just missing something.
If you want to combine a scalar float with an existing vector, there doesn't seem to be a way to do it without high-element-zeroing or broadcasting the scalar into a vector, using Intel intrinsics. I haven't investigated GNU C native vector extensions and the associated builtins.
This wouldn't be too bad if the extra intrinsic optimized away, but it doesn't with gcc (5.4 or 6.2). There's also no nice way to use pmovzx
or insertps
as loads, for the related reason that their intrinsics only take vector args. (And gcc doesn't fold a scalar->vector load into the asm instruction.)
__m128 replace_lower_two_elements(__m128 v, float x) {
__m128 xv = _mm_set_ss(x); // WANTED: something else for this step, some compilers actually compile this to a separate insn
return _mm_shuffle_ps(v, xv, 0); // lower 2 elements are both x, and the garbage is gone
}
gcc 5.3 -march=nehalem -O3 output, to enable SSE4.1 and tune for that Intel CPU: (It's even worse without SSE4.1; multiple instructions to zero the upper elements).
insertps xmm1, xmm1, 0xe # pointless zeroing of upper elements. shufps only reads the low element of xmm1
shufps xmm0, xmm1, 0 # The function *should* just compile to this.
ret
TL:DR: the rest of this question is just asking if you can actually do this efficiently, and if not why not.
clang's shuffle-optimizer gets this right, and doesn't waste instructions on zeroing high elements (_mm_set_ss(x)
), or duplicating the scalar into them (_mm_set1_ps(x)
). Instead of writing something the compiler has to optimize away, shouldn't there be a way to write it "efficiently" in C in the first place? Even very recent gcc doesn't optimize it away, so this is a real (but minor) problem.
This would be possible if there was a scalar->128b equivalent of __m256 _mm256_castps128_ps256 (__m128 a)
. i.e. produce a __m128
with undefined garbage in upper elements, and the float in the low element, compiling to zero asm instructions if the scalar float/double was already in an xmm register.
None of the following intrinsics exist, but they should.
a scalar->__m128 equivalent of
_mm256_castps128_ps256
as described above. The most general solution for the scalar-already-in-register case.__m128 _mm_move_ss_scalar (__m128 a, float s)
: replace low element of vectora
with scalars
. This isn't actually necessary if there's a general-purpose scalar->__m128 (previous bullet point). (The reg-reg form ofmovss
merges, unlike the load form which zeros, and unlikemovd
which zeros upper elements in both cases. To copy a register holding a scalar float without false dependencies, usemovaps
).__m128i _mm_loadzxbd (const uint8_t *four_bytes)
and other sizes of PMOVZX / PMOVSX: AFAICT, there's no good safe way to use the PMOVZX intrinsics as a load, because the inconvenient safe way doesn't optimize away with gcc.__m128 _mm_insertload_ps (__m128 a, float *s, const int imm8)
. INSERTPS behaves differently as a load: the upper 2 bits of the imm8 are ignored, and it always takes the scalar at the effective address (instead of an element from a vector in memory). This lets it work with addresses that aren't 16B-aligned, and work even without faulting if thefloat
right before an unmapped page.Like with PMOVZX, gcc fails to fold an upper-element-zeroing
_mm_load_ss()
into a memory operand for INSERTPS. (Note that if the upper 2 bits of the imm8 aren't both zero, then_mm_insert_ps(xmm0, _mm_load_ss(), imm8)
can compile toinsertps xmm0,xmm0,foo
, with a different imm8 that zeros elements in vec as-if the src element was actually a zero produced by MOVSS from memory. Clang actually uses XORPS/BLENDPS in that case)
Are there any viable workarounds to emulate any of those that are both safe (don't break at -O0 by e.g. loading 16B that might touch the next page and segfault), and efficient (no wasted instructions at -O3 with current gcc and clang at least, preferably also other major compilers)? Preferably also in a readable way, but if necessary it could be put behind an inline wrapper function like __m128 float_to_vec(float a){ something(a); }
.
Is there any good reason for Intel not to introduce intrinsics like that? They could have added a float->__m128 with undefined upper elements at the same time as adding _mm256_castps128_ps256
. Is this a matter of compiler internals making it hard to implement? Perhaps specifically ICC internals?
The major calling conventions on x86-64 (SysV or MS __vectorcall
) take the first FP arg in xmm0 and return scalar FP args in xmm0, with upper elements undefined. (See the x86 tag wiki for ABI docs). This means it's not uncommon for the compiler to have a scalar float/double in a register with unknown upper elements. This will be rare in a vectorized inner loop, so I think avoiding these useless instructions will mostly just save a bit of code size.
The pmovzx case is more serious: that is something you might use in an inner loop (e.g. for a LUT of VPERMD shuffle masks, saving a factor of 4 in cache footprint vs. storing each index padded to 32 bits in memory).
The pmovzx-as-a-load issue has been bothering me for a while now, and the original version of this question got me thinking about the related issue of using a scalar float in an xmm register. There are probably more use-cases for pmovzx as a load than for scalar->__m128.