This problem is not very similar to memcpy. It's all about optimizing the interleaving with shuffles and/or scalar store of the loop-invariant integer member. That makes SIMD hard.
Do you need to have this storage format with the int
interleaved with the float
members? Interleaving the floats is bad enough. I assume some later code is going to modify the int
s in different structs, otherwise it makes no sense to duplicate it for every element.
Could you work in groups of 4 elements, like struct { float a[4], b[4]; int i[4]; };
so you could load+convert 4x contiguous double
into 4x float
and do a 128-bit SIMD store? You'd still have some spatial locality when accessing all 3 members of a single output-array "struct".
Anyway, assuming your output format has to be fully interleaved, we don't need to pad it to 16 bytes. x86 CPUs can efficiently handle overlapping 16-byte stores to work with 12-byte structs, like @harold's answer shows. Cache-line splits probably cost less than the extra memory bandwidth needed to store the padding.
Or another strategy would be to use separate stores for the floats vs. the int
, so you don't need overlapping. We can probably optimize that to the point where it should bottleneck on 1 store per clock cycle for 1 struct per 2 cycles. (Or slightly lower because IIRC cache-split stores need to replay the store uop, at least on Intel CPUs.) We could also unroll by 4*12 = 3*16
bytes and save 2 integer stores by using SIMD stores that gets overlapped by float data. 48 bytes = xyIx|yIxy|IxyI
has four I
elements as part of four structs, but they're close enough that we can store all 4 with two _mm_storeu_si128( set1(offset) )
intrinsics. Then store the xy
pairs overlapping with that. 16-byte boundaries are marked with |
. If cache line splits are a problem, we could do 2x scalar and one SIMD for the last vector which is aligned (if the output array is 16-byte aligned). Or on Intel Haswell and later CPUs, a 32-byte aligned store might be good.
If we're not careful, we can very easily bottleneck on shuffle throughput on Intel CPUs, especially Sandybridge-family (SnB through Skylake/Coffee Lake) where FP shuffles can only run on port 5. This is why I'm considering not shuffling everything together for 1 store per struct.
SIMD double->float conversion costs 2 uops: shuffle + FP-math, because float is half the width and the instruction packs the floats into the bottom of the vector register.
AVX is useful here to convert 4 double
s into a SIMD vector of 4 float
s.
Other than that, I agree with @harold that 128-bit vectors are probably a good bet. Even AVX2 doesn't have very good 2-input lane-crossing shuffles, and AVX1 is very limited. So we can use 256-bit -> 128-bit double->float conversion to feed an interleave strategy based on __m128
.
vmovhps [mem], xmm
doesn't cost a shuffle uop on Intel CPUs, just a pure store, so shuffling together 2 vectors and getting [ B1 A1 B0 A0 ]
into a single vector sets us up for two 64-bit stores of the low and high halves without any extra shuffling.
OTOH, @harold's version might still be better. 4 shuffles per 2 structs might be better than 4 stores per 2 structs, since the stores will sometimes need to replay for cache line splits, but shuffles don't. But with the overlapping stores trick, 3.5 or 3 stores per 2 structs looks doable.
Or here's another idea that uses some of the above, but does some blending to save stores
I basically came up with this while editing @harold's code to implement the idea I wrote about in the text above. Using a blend here is a good way to reduce pressure on store and shuffle ports.
Some of those ideas above are still worth exploring, especially doing a wide store of set1(offset)
and then overlapping it with 64-bit vmovlps
stores. (After unrolling by 3x2 = 6 or 3x4 = 12 output structs, to make it a multiple of the 4 doubles we convert at once.) 12 * 12 = 144
bytes, which is a multiple of 16 but not 32 or 64, so we could at least know where we are relative to a 16-byte boundary at all times, but not to cache lines unless we unroll even more. (Potentially leaving more work that needs cleanup, and bloating code-size.)
#include <immintrin.h>
#include <stddef.h>
#include <stdint.h>
struct f2u {
float O1, O2;
unsigned int Offset;
};
// names with a leading _ at file scope are reserved for the implementation.
// fixed that portability problem for you.
static const unsigned uiDefaultOffset = 123;
// only requires AVX1
// ideally pA and pB should be 32-byte aligned.
// probably also dst 16-byte aligned is good.
void cvt_interleave_avx(f2u *__restrict dst, double *__restrict pA, double *__restrict pB, ptrdiff_t len)
{
__m128 voffset = _mm_castsi128_ps(_mm_set1_epi32(uiDefaultOffset));
// 48 bytes per iteration: 3x16 = 4x12
ptrdiff_t i;
for (i = 0; i < len - 3; i += 4)
{
// read inputs and convert to float
__m256d inA = _mm256_loadu_pd(&pA[i]);
__m256d inB = _mm256_loadu_pd(&pB[i]);
__m128 inAf = _mm256_cvtpd_ps(inA); // A3 A2 A1 A0
__m128 inBf = _mm256_cvtpd_ps(inB); // B3 B2 B1 B0
// interleave to get XY pairs
__m128 lo = _mm_unpacklo_ps(inAf, inBf); // B1 A1 B0 A0
__m128 hi = _mm_unpackhi_ps(inAf, inBf); // B3 A3 B2 A2
// blend integer into place
__m128 out0 = _mm_blend_ps(lo, voffset, 1<<2); // x OF B0 A0
__m128 out2 = _mm_blend_ps(hi, voffset, 1<<2); // x OF B2 A2
// TODO: _mm_alignr_epi8 to create OF OF B1 A1 spending 1 more shuffle to save a store.
// store results
_mm_storeu_ps(&dst[i + 0].O1, out0); // 16 bytes with blended integer
_mm_storeh_pi((__m64*)&dst[i + 1].O1, lo); // 8 bytes from top half of reg, partial overlap
dst[i + 1].Offset = uiDefaultOffset;
_mm_storeu_ps(&dst[i + 2].O1, out2); // 16 bytes with blended integer
_mm_storeh_pi((__m64*)&dst[i + 3].O1, hi); // 8 bytes from top half of reg, partial overlap
dst[i + 3].Offset = uiDefaultOffset;
}
// scalar cleanup for if _iNum is not even
for (; i < len; i++)
{
dst[i].O1 = static_cast<float>(pA[i]);
dst[i].O2 = static_cast<float>(pB[i]);
dst[i].Offset = uiDefaultOffset;
}
}
gcc9.1 -O3 -march=skylake
on Godbolt compiles the main loop to 19 fused-domain uops for the front-end. (Neither vcvtpd2ps
instructions could micro-fuse because GCC didn't do anything clever like addressing pB
relative to pA
to avoid an indexed addressing mode for one of them. So they're each 3 uops: load + convert + shuffle)
But it does bottleneck on stores in the back-end anyway, even if it takes a full 5 cycles per iteration to issue from the 4-wide front-end.
With 6 stores (per 4 structs) per iteration, that will bottleneck it to at best 1 iteration per 6 cycles, bottlenecked on the store-data port/execution unit. (Until Ice Lake which can do 2 stores per clock.) So this achieves 1 struct per 1.5 cycles in the theoretical best case, same as I was estimating for the overlapping-store idea before.
(We already know that cache-line split stores are going to need to be replayed, costing throughput, so we know this won't quite manage 1.5 cycles per struct even with no cache misses. But it's probably still better than Harold's bottleneck of 4 cycles per 2 structs = 2 cycles per struct. That speed should actually be achievable though, because it bottlenecks on shuffles which don't need to get replayed on cache-line splits.)
I expect throughput on Ryzen will be similar, bottlenecked on store throughput. We're using 128-bit vectors mostly, and Ryzen has better shuffle throughput than Intel. On SnB-family, there are 4 shuffle uops in the loop.
If I could shuffle differently so I could get two contiguous structs as the high half of the pair of vectors, that would open up the possibility of combining the 2 scalar assignments into one _mm_storeu_si128
that I overlap with two _mm_storeh_pi
(movhps
) 64-bit stores. (Still doing two blends for the other two output structs.) That would get it down to 5 stores total.
But shufps
has restrictions on where it takes source data from, so you can't use it to emulate unpcklps
or interleave differently.
Probably it would be worth using palignr
for the B1 A1 struct, spending an extra shuffle uop to save a store.
I haven't benchmarked this or calculated how often the unaligned stores will cross a cache line boundary (and thus cost throughput).
AVX512
If we had AVX512, we'd have 2-input lane-crossing shuffles that could let us build up vectors of float+int data more efficiently, with fewer shuffle and store instructions per struct. (We could use vpermt2ps
with merge-masking into set1(integer)
to interleave 2 vectors of conversion results along with integers in the right places.)