You could use _mm256_store_pd( (double*)t, a)
. I'm pretty sure this is strict-aliasing safe because you're not directly dereferencing the pointer after casting it. The _mm256_store_pd
intrinsic wraps the store with any necessary may-alias stuff.
(With AVX512, Intel switched to using void*
for the load/store intrinsics instead of float*
, double*
, or __m512i*
, to remove the need for these clunky casts and make it more clear that intrinsics can alias anything.)
The other option is to _mm256_castpd_si256
to reinterpret the bits of your __m256d
as a __m256i
:
alignas(32) uint64_t t[4];
_mm256_store_si256( (__m256i*)t, _mm256_castpd_si256(a));
If you read from t[]
right away, your compiler might optimize away the store/reload and just shuffle or pextrq rax, xmm0, 1
to extract FP bit patterns directly into integer registers. You could write this manually with intrinsics. Store/reload is not bad, though, especially if you want more than 1 of the double
bit-patterns as scalar integers.
You could instead use union m256_elements { uint64_t u64[4]; __m256d vecd; };
, but there's no guarantee that will compile efficiently.
This cast compiles to zero asm instructions, i.e. it's just a type-pun to keep the C compiler happy.
If you wanted to actually round packed double
to the nearest signed or unsigned 64-bit integer and have the result in 2's complement or unsigned binary instead of IEEE754 binary64, you need AVX512F _mm256/512_cvtpd_epi64
(vcvtpd2qq
) for it to be efficient. SSE2 + x86-64 can do it for scalar, or you can use some packed FP hacks for numbers in the [0..2^52]
range: How to efficiently perform double/int64 conversions with SSE/AVX?.
BTW, storeu
doesn't require an aligned destination, but store
does. If the destination is a local, you should normally align it instead of using an unaligned store, at least if the store happens in a loop, or if this function can inline into a larger function.