1

I often need to use double for accuracy reasons, but I want to store the results as floats. What is the optimal way? I'm current using:

SSE2: _mm_store_sd((double*)dst, _mm_castps_pd(_mm_cvtpd_ps(xmm)));

AVX: _mm_storeu_ps(dst, _mm256_cvtpd_ps(ymm));

AVX512: _mm256_storeu_ps(dst, _mm512_cvtpd_ps(zmm));

Any improvement ideas?

Henkersmann
  • 1,190
  • 8
  • 21
Vojtěch Melda Meluzín
  • 1,117
  • 3
  • 11
  • 22
  • 1
    I think your code is ok. You can use `_mm_storel_pd` or `_mm_storel_pi ` instead of `_mm_store_sd`. On Intel Skylake that would be as fast as your solution, according to Agner Fogs instruction tables. Maybe there is a small difference on some older architectures, but I didn't check. – wim Oct 18 '18 at 10:24
  • Thanks for the info! – Vojtěch Melda Meluzín Oct 18 '18 at 10:32
  • See also Peter Cordes' comment [here](https://stackoverflow.com/questions/36048502/sse-instruction-movsd-extended-floating-point-scalar-vector-operations-on-x8#comment59749434_36048502). – wim Oct 18 '18 at 12:41
  • 1
    The number of instruction bytes might differ between the different instructions, which may have a very minor impact on the performance in some cases. See also this [Godbolt link](https://gcc.godbolt.org/z/9GvSX2). In principle the compiler should choose the best instruction. However, Clang prefers `movlps`, while GCC prefers `movlpd`. – wim Oct 18 '18 at 13:38

1 Answers1

2

Conversion from packed-double to packed-float is only available in narrowing form, not in a version that takes 2 vectors of double and packs into 1 vector of float. So yes, the intrinsics for [v]cvtpd2ps are your only option. These instructions decode to 2 uops on modern Intel; one for the FMA port(s) and one for the shuffle port. (https://agner.org/optimize/)

Storing the result is straightforward, some form of _mm_store/storeu is what you want.


For 128-bit vectors (resulting in 2x float = 64 bits), you don't have a whole 128-bit vector of results. You could shuffle two together into a 128-bit vector, but with an FP shuffle throughput of 1 per clock on Intel since Sandybridge, it's probably best to just store them both separately.

You want movlps instead of movsd to store the low 64 bits of a float vector; It saves one instruction byte, and the C intrinsic takes less casting to use. But unfortunately it takes a __m64* instead of a float*, so you do still need one cast:

_mm_storel_pi((__m64*)dst,   _mm_cvtpd_ps(xmm) );

But for loading, you definitely do want movsd to avoid a false dependency on the old value. movlps loads merge into a register; movsd loads zero-extend. Actually, cvtps2pd xmm, qword [mem] takes care of that for you, if you can get the compiler to emit that from intrinsics.

It might be hard to do it safely, for similar reasons to pmovzxbw xmm, qword [mem]: compilers fail to fold a qword load into a memory operand for pmovzx/sx: (Loading 8 chars from memory into an __m256 variable as packed single precision floats)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847