Conversion from packed-double to packed-float is only available in narrowing form, not in a version that takes 2 vectors of double and packs into 1 vector of float. So yes, the intrinsics for [v]cvtpd2ps
are your only option. These instructions decode to 2 uops on modern Intel; one for the FMA port(s) and one for the shuffle port. (https://agner.org/optimize/)
Storing the result is straightforward, some form of _mm_store/storeu
is what you want.
For 128-bit vectors (resulting in 2x float
= 64 bits), you don't have a whole 128-bit vector of results. You could shuffle two together into a 128-bit vector, but with an FP shuffle throughput of 1 per clock on Intel since Sandybridge, it's probably best to just store them both separately.
You want movlps
instead of movsd
to store the low 64 bits of a float
vector; It saves one instruction byte, and the C intrinsic takes less casting to use. But unfortunately it takes a __m64*
instead of a float*
, so you do still need one cast:
_mm_storel_pi((__m64*)dst, _mm_cvtpd_ps(xmm) );
But for loading, you definitely do want movsd
to avoid a false dependency on the old value. movlps
loads merge into a register; movsd
loads zero-extend. Actually, cvtps2pd xmm, qword [mem]
takes care of that for you, if you can get the compiler to emit that from intrinsics.
It might be hard to do it safely, for similar reasons to pmovzxbw xmm, qword [mem]
: compilers fail to fold a qword load into a memory operand for pmovzx/sx: (Loading 8 chars from memory into an __m256 variable as packed single precision floats)