8

I have a section of code which is a bottleneck in a C++ application running on x86 processors, where we take double values from two arrays, cast to float and store in an array of structs. The reason this is a bottleneck is it is called either with very large loops, or thousands of times.

Is there a faster way to do this copy & cast operation using SIMD Intrinsics? I have seen this answer on faster memcpy but doesn't address the cast.

The simple C++ loop case looks like this

        int _iNum;
        const unsigned int _uiDefaultOffset; // a constant 

        double * pInputValues1; // array of double values, count = _iNum;
        double * pInputValues2; 

        MyStruct * pOutput;    // array of outputs defined as
        // struct MyStruct 
        // { 
        //    float O1;
        //    float O2;
        //    unsigned int Offset;
        // };

        for (int i = 0; i < _iNum; ++i)
        {
            _pPoints[i].O1 = static_cast<float>(pInputValues1[i]);
            _pPoints[i].O2 = static_cast<float>(pInputValues2[i]);
            _pPoints[i].Offset = _uiDefaultOffset;
        }

Note: The struct format is [Float,Float,Int] (24 bytes) but we could (if it will help performance) add an extra 4bytes padding making it 32 bytes.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Dr. Andrew Burnett-Thompson
  • 20,980
  • 8
  • 88
  • 178
  • 1
    That array of `MyStruct` has an annoying layout to fill it with SIMD, something can be worked out but it will cost extra overhead to shuffle everything into place – harold Jul 12 '19 at 20:27
  • 1
    Can you switch the struct to using `double`'s instead of `float`'s, or switch the `double` arrays to` float` arrays? Also, are you guaranteed that all `pInputValues1[i]` are in the range of a `float`? – NathanOliver Jul 12 '19 at 20:29
  • Unfortunately the input must be double, and output must be float. There will be some precision loss - this code is used in graphics so minor precision loss is acceptable. We can ... pad an extra 4-bytes to make it a 32-byte sized struct if that would help, but it must be Float,Float,Int for the first 3 elements. – Dr. Andrew Burnett-Thompson Jul 12 '19 at 20:31
  • I never knew that i386 had SIMD. – SergeyA Jul 12 '19 at 20:39
  • heh, I mean 'not ARM'. AVX/AVX2 can be assumed. – Dr. Andrew Burnett-Thompson Jul 12 '19 at 20:50
  • 2
    @Dr.ABT: `{float,float,int}` is 12 bytes with no padding, not 24. IDK if that mistake in your question is from `double,double,int` with implicit padding for alignment after the 32-bit int, or if you're thinking about a pair of structs and unrolling the loop. – Peter Cordes Jul 13 '19 at 08:19
  • No you're right, 12 bytes. Sorry my mistake. We pad to 16 if it makes it easier – Dr. Andrew Burnett-Thompson Jul 13 '19 at 10:15

3 Answers3

7

Here is an attempt with SSE4.1, no AVX (that's trickier to do and so far I'm coming up with even more shuffles), and using the 12byte/point format: (not tested)

void test3(MyStruct * _pPoints, double * pInputValues1, double * pInputValues2) {
        // struct MyStruct 
        // { 
        //    float O1;
        //    float O2;
        //    unsigned int Offset;
        // };
    __m128 offset = _mm_castsi128_ps(_mm_cvtsi32_si128(_uiDefaultOffset));
    int i;
    for (i = 0; i < _iNum - 2; i += 2)
    {
        // read inputs and convert to float
        __m128d inA = _mm_loadu_pd(&pInputValues1[i]);
        __m128d inB = _mm_loadu_pd(&pInputValues2[i]);
        __m128 inAf = _mm_cvtpd_ps(inA);    // 0 0 A1 A0
        __m128 inBf = _mm_cvtpd_ps(inB);    // 0 0 B1 B0
        // shuffle B0 from place 0 to place 1, merge with offset
        __m128 tempA = _mm_shuffle_ps(inBf, offset, _MM_SHUFFLE(1, 0, 0, 0)); // 0 OF B0 B0
        // shuffle A1 from place 1 to place 0, merge with offset
        __m128 tempB = _mm_shuffle_ps(inAf, offset, _MM_SHUFFLE(1, 0, 1, 1)); // 0 OF A1 A1
        // replace B0 at place 0 with A0
        __m128 outA = _mm_blend_ps(tempA, inAf, 1);  // 0 OF B0 A0
        // replace A1 at place 1 with B1
        __m128 outB = _mm_blend_ps(tempB, inBf, 2);  // 0 OF B1 A1
        // store results
        _mm_storeu_ps(&_pPoints[i].O1, outA);
        _mm_storeu_ps(&_pPoints[i + 1].O1, outB);
    }
    // remaining iteration if _iNum is not even
    for (; i < _iNum; i++)
    {
        _pPoints[i].O1 = static_cast<float>(pInputValues1[i]);
        _pPoints[i].O2 = static_cast<float>(pInputValues2[i]);
        _pPoints[i].Offset = _uiDefaultOffset;
    }
}

This uses the ability of shufps to choose from two different sources to do the merging of the dynamic data and the constant offset, the same shuffles also move the float in each group that needs to move. Then blends are used to replace a single float with an other float that was already at the right place. This takes 2 shuffles and 2 blends, there is also a way with 3 shuffles and zero blends, but the shuffles all go to p5 on current Intel processors while the blend can go to a different port. The conversions already use p5 too so it's getting swamped, using the blends should be better. It's still 4 p5 µops per iteration so it takes at least 2 cycles per item processed, which is not great.

The main loop skips the last items so that it doesn't write out of bounds, it does slightly-overlapping 16 byte stores that write 4 bytes beyond the end of the struct. That part gets overwritten with the real result by the next store, but it might be dangerous to do that at the end of the array.

harold
  • 61,398
  • 6
  • 86
  • 164
  • 1
    Would it be useful to feed a cvtpd2ps with `movsd` + `movhps` loads? `movhps xmm, [mem]` does cost a micro-fused load + port5 shuffle. Also, scalar copy of the integer part might work fine, so we could use a `movsd` or `movlps` store of the 2 floats. That might balance load+store vs. shuffle throughput for Intel CPUs that bottleneck at 1 shuffle per clock. OTOH, your `vcvtpd2ps xmm, [mem]` is pretty good with AVX. And with overlapping stores, you can keep it down to 12 or 24 bytes per struct. – Peter Cordes Jul 13 '19 at 04:13
  • 2
    AVX for 256-bit `vcvtpd2ps xmm0, ymmword [mem]` could be useful here to get `[A3 A2 A1 A0]` and the same for B into `__m128` vectors you can shuffle together in 2 different ways (`unpcklps` and `unpckhps`). That sets you up for `vmovlps` and `vmovhps` stores of the FP parts. `vmovhps [mem], xmm` doesn't cost a shuffle uop on Intel SnB-family, just a pure store. But you do need to copy the integer elements separately, probably with scalar load/store to avoid shuffle-port bottlenecks. Bottleneck = 2x stores per output struct (until Ice Lake) – Peter Cordes Jul 13 '19 at 08:31
  • 1
    @PeterCordes can you work it out, I don't really have time for it right now – harold Jul 13 '19 at 08:38
  • Posted an answer that would do 4 structs per 6 clock cycles (or 4 per 5 with another tweak) ignoring the cost of replaying cache-line split store uops. (If I'm remembering correctly that split stores need to be replayed, or can store-buffer entries be misaligned and need 2 commit cycles?) – Peter Cordes Jul 13 '19 at 10:32
  • ^^ this is tested and on my PC clocking in at 700ms vs. the original 1300ms for the same loop when iterating over 1 billion elements. – Dr. Andrew Burnett-Thompson Jul 15 '19 at 10:13
6

This problem is not very similar to memcpy. It's all about optimizing the interleaving with shuffles and/or scalar store of the loop-invariant integer member. That makes SIMD hard.

Do you need to have this storage format with the int interleaved with the float members? Interleaving the floats is bad enough. I assume some later code is going to modify the ints in different structs, otherwise it makes no sense to duplicate it for every element.

Could you work in groups of 4 elements, like struct { float a[4], b[4]; int i[4]; }; so you could load+convert 4x contiguous double into 4x float and do a 128-bit SIMD store? You'd still have some spatial locality when accessing all 3 members of a single output-array "struct".


Anyway, assuming your output format has to be fully interleaved, we don't need to pad it to 16 bytes. x86 CPUs can efficiently handle overlapping 16-byte stores to work with 12-byte structs, like @harold's answer shows. Cache-line splits probably cost less than the extra memory bandwidth needed to store the padding.

Or another strategy would be to use separate stores for the floats vs. the int, so you don't need overlapping. We can probably optimize that to the point where it should bottleneck on 1 store per clock cycle for 1 struct per 2 cycles. (Or slightly lower because IIRC cache-split stores need to replay the store uop, at least on Intel CPUs.) We could also unroll by 4*12 = 3*16 bytes and save 2 integer stores by using SIMD stores that gets overlapped by float data. 48 bytes = xyIx|yIxy|IxyI has four I elements as part of four structs, but they're close enough that we can store all 4 with two _mm_storeu_si128( set1(offset) ) intrinsics. Then store the xy pairs overlapping with that. 16-byte boundaries are marked with |. If cache line splits are a problem, we could do 2x scalar and one SIMD for the last vector which is aligned (if the output array is 16-byte aligned). Or on Intel Haswell and later CPUs, a 32-byte aligned store might be good.


If we're not careful, we can very easily bottleneck on shuffle throughput on Intel CPUs, especially Sandybridge-family (SnB through Skylake/Coffee Lake) where FP shuffles can only run on port 5. This is why I'm considering not shuffling everything together for 1 store per struct.

SIMD double->float conversion costs 2 uops: shuffle + FP-math, because float is half the width and the instruction packs the floats into the bottom of the vector register.

AVX is useful here to convert 4 doubles into a SIMD vector of 4 floats.

Other than that, I agree with @harold that 128-bit vectors are probably a good bet. Even AVX2 doesn't have very good 2-input lane-crossing shuffles, and AVX1 is very limited. So we can use 256-bit -> 128-bit double->float conversion to feed an interleave strategy based on __m128.

vmovhps [mem], xmm doesn't cost a shuffle uop on Intel CPUs, just a pure store, so shuffling together 2 vectors and getting [ B1 A1 B0 A0 ] into a single vector sets us up for two 64-bit stores of the low and high halves without any extra shuffling.

OTOH, @harold's version might still be better. 4 shuffles per 2 structs might be better than 4 stores per 2 structs, since the stores will sometimes need to replay for cache line splits, but shuffles don't. But with the overlapping stores trick, 3.5 or 3 stores per 2 structs looks doable.


Or here's another idea that uses some of the above, but does some blending to save stores

I basically came up with this while editing @harold's code to implement the idea I wrote about in the text above. Using a blend here is a good way to reduce pressure on store and shuffle ports.

Some of those ideas above are still worth exploring, especially doing a wide store of set1(offset) and then overlapping it with 64-bit vmovlps stores. (After unrolling by 3x2 = 6 or 3x4 = 12 output structs, to make it a multiple of the 4 doubles we convert at once.) 12 * 12 = 144 bytes, which is a multiple of 16 but not 32 or 64, so we could at least know where we are relative to a 16-byte boundary at all times, but not to cache lines unless we unroll even more. (Potentially leaving more work that needs cleanup, and bloating code-size.)

#include <immintrin.h>
#include <stddef.h>
#include <stdint.h>

struct f2u { 
  float O1, O2;
  unsigned int Offset;
};

// names with a leading _ at file scope are reserved for the implementation.
// fixed that portability problem for you.
static const unsigned uiDefaultOffset = 123;


// only requires AVX1
// ideally pA and pB should be 32-byte aligned.
// probably also dst 16-byte aligned is good.
void cvt_interleave_avx(f2u *__restrict dst, double *__restrict pA, double *__restrict pB, ptrdiff_t len)
{
    __m128 voffset = _mm_castsi128_ps(_mm_set1_epi32(uiDefaultOffset));

    // 48 bytes per iteration: 3x16 = 4x12
    ptrdiff_t i;
    for (i = 0; i < len - 3; i += 4)
    {
        // read inputs and convert to float
        __m256d inA = _mm256_loadu_pd(&pA[i]);
        __m256d inB = _mm256_loadu_pd(&pB[i]);
        __m128 inAf = _mm256_cvtpd_ps(inA);    // A3 A2 A1 A0
        __m128 inBf = _mm256_cvtpd_ps(inB);    // B3 B2 B1 B0

        // interleave to get XY pairs
        __m128 lo = _mm_unpacklo_ps(inAf, inBf); // B1 A1 B0 A0
        __m128 hi = _mm_unpackhi_ps(inAf, inBf); // B3 A3 B2 A2

        // blend integer into place
        __m128 out0 = _mm_blend_ps(lo, voffset, 1<<2);  // x OF B0 A0
        __m128 out2 = _mm_blend_ps(hi, voffset, 1<<2);  // x OF B2 A2

        // TODO: _mm_alignr_epi8 to create OF OF B1 A1 spending 1 more shuffle to save a store.

        // store results
        _mm_storeu_ps(&dst[i + 0].O1, out0);  // 16 bytes with blended integer
        _mm_storeh_pi((__m64*)&dst[i + 1].O1, lo);    // 8 bytes from top half of reg, partial overlap
        dst[i + 1].Offset = uiDefaultOffset;

        _mm_storeu_ps(&dst[i + 2].O1, out2);  // 16 bytes with blended integer
        _mm_storeh_pi((__m64*)&dst[i + 3].O1, hi);    // 8 bytes from top half of reg, partial overlap
        dst[i + 3].Offset = uiDefaultOffset;
    }

    // scalar cleanup for  if _iNum is not even
    for (; i < len; i++)
    {
        dst[i].O1 = static_cast<float>(pA[i]);
        dst[i].O2 = static_cast<float>(pB[i]);
        dst[i].Offset = uiDefaultOffset;
    }
}

gcc9.1 -O3 -march=skylake on Godbolt compiles the main loop to 19 fused-domain uops for the front-end. (Neither vcvtpd2ps instructions could micro-fuse because GCC didn't do anything clever like addressing pB relative to pA to avoid an indexed addressing mode for one of them. So they're each 3 uops: load + convert + shuffle)

But it does bottleneck on stores in the back-end anyway, even if it takes a full 5 cycles per iteration to issue from the 4-wide front-end.

With 6 stores (per 4 structs) per iteration, that will bottleneck it to at best 1 iteration per 6 cycles, bottlenecked on the store-data port/execution unit. (Until Ice Lake which can do 2 stores per clock.) So this achieves 1 struct per 1.5 cycles in the theoretical best case, same as I was estimating for the overlapping-store idea before.

(We already know that cache-line split stores are going to need to be replayed, costing throughput, so we know this won't quite manage 1.5 cycles per struct even with no cache misses. But it's probably still better than Harold's bottleneck of 4 cycles per 2 structs = 2 cycles per struct. That speed should actually be achievable though, because it bottlenecks on shuffles which don't need to get replayed on cache-line splits.)

I expect throughput on Ryzen will be similar, bottlenecked on store throughput. We're using 128-bit vectors mostly, and Ryzen has better shuffle throughput than Intel. On SnB-family, there are 4 shuffle uops in the loop.

If I could shuffle differently so I could get two contiguous structs as the high half of the pair of vectors, that would open up the possibility of combining the 2 scalar assignments into one _mm_storeu_si128 that I overlap with two _mm_storeh_pi (movhps) 64-bit stores. (Still doing two blends for the other two output structs.) That would get it down to 5 stores total.

But shufps has restrictions on where it takes source data from, so you can't use it to emulate unpcklps or interleave differently.

Probably it would be worth using palignr for the B1 A1 struct, spending an extra shuffle uop to save a store.

I haven't benchmarked this or calculated how often the unaligned stores will cross a cache line boundary (and thus cost throughput).


AVX512

If we had AVX512, we'd have 2-input lane-crossing shuffles that could let us build up vectors of float+int data more efficiently, with fewer shuffle and store instructions per struct. (We could use vpermt2ps with merge-masking into set1(integer) to interleave 2 vectors of conversion results along with integers in the right places.)

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
5

Loosely inspired by Intel's 4x3 transposition example and based on @PeterCordes solution, here is an AVX1 solution, which should get a throughput of 8 structs within 8 cycles (bottleneck is still p5):

#include <immintrin.h>
#include <stddef.h>

struct f2u { 
  float O1, O2;
  unsigned int Offset;
};
static const unsigned uiDefaultOffset = 123;

void cvt_interleave_avx(f2u *__restrict dst, double *__restrict pA, double *__restrict pB, ptrdiff_t len)
{
    __m256 voffset = _mm256_castsi256_ps(_mm256_set1_epi32(uiDefaultOffset));

    // 8 structs per iteration
    ptrdiff_t i=0;
    for(; i<len-7; i+=8)
    {
        // destination address for next 8 structs as float*:
        float* dst_f = reinterpret_cast<float*>(dst + i);

        // 4*vcvtpd2ps    --->  4*(p1,p5,p23)
        __m128 inA3210 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pA[i]));
        __m128 inB3210 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pB[i]));
        __m128 inA7654 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pA[i+4]));
        __m128 inB7654 = _mm256_cvtpd_ps(_mm256_loadu_pd(&pB[i+4]));

        // 2*vinsertf128  --->  2*p5
        __m256 A76543210 = _mm256_set_m128(inA7654,inA3210);
        __m256 B76543210 = _mm256_set_m128(inB7654,inB3210);

        // 2*vpermilps    --->  2*p5
        __m256 A56741230 = _mm256_shuffle_ps(A76543210,A76543210,_MM_SHUFFLE(1,2,3,0));
        __m256 B67452301 = _mm256_shuffle_ps(B76543210,B76543210,_MM_SHUFFLE(2,3,0,1));

        // 6*vblendps     ---> 6*p015 (does not need to use p5)
        __m256 outA1__B0A0 = _mm256_blend_ps(A56741230,B67452301,2+16*2);
        __m256 outA1ccB0A0 = _mm256_blend_ps(outA1__B0A0,voffset,4+16*4);

        __m256 outB2A2__B1 = _mm256_blend_ps(B67452301,A56741230,4+16*4);
        __m256 outB2A2ccB1 = _mm256_blend_ps(outB2A2__B1,voffset,2+16*2);

        __m256 outccB3__cc = _mm256_blend_ps(voffset,B67452301,4+16*4);
        __m256 outccB3A3cc = _mm256_blend_ps(outccB3__cc,A56741230,2+16*2);

        // 3* vmovups     ---> 3*(p237,p4)
        _mm_storeu_ps(dst_f+ 0,_mm256_castps256_ps128(outA1ccB0A0));
        _mm_storeu_ps(dst_f+ 4,_mm256_castps256_ps128(outB2A2ccB1));
        _mm_storeu_ps(dst_f+ 8,_mm256_castps256_ps128(outccB3A3cc));
        // 3*vextractf128 ---> 3*(p23,p4)
        _mm_storeu_ps(dst_f+12,_mm256_extractf128_ps(outA1ccB0A0,1));
        _mm_storeu_ps(dst_f+16,_mm256_extractf128_ps(outB2A2ccB1,1));
        _mm_storeu_ps(dst_f+20,_mm256_extractf128_ps(outccB3A3cc,1));
    }

    // scalar cleanup for  if _iNum is not even
    for (; i < len; i++)
    {
        dst[i].O1 = static_cast<float>(pA[i]);
        dst[i].O2 = static_cast<float>(pB[i]);
        dst[i].Offset = uiDefaultOffset;
    }
}

Godbolt link, with minimal test-code at the end: https://godbolt.org/z/0kTO2b

For some reason, gcc does not like to generate vcvtpd2ps which directly convert from memory to a register. This might works better with aligned loads (having input and output aligned is likely beneficial anyway). And clang apparently wants to outsmart me with one of the vextractf128 instructions at the end.

chtz
  • 17,329
  • 4
  • 26
  • 56
  • 1
    re: lack of `vcvtpd2ps xmm, ymmword [mem]`: you compiled with `-mtune=generic` which still unfortunately includes `-mavx256-split-unaligned-load` (and store). [Why doesn't gcc resolve \_mm256\_loadu\_pd as single vmovupd?](//stackoverflow.com/q/52626726). That's why I always use `-march=skylake` or similar. Or at least `-mtune=skylake`. It's really unfortunate that there's no `tune=generic-avx2` or `tune=arch` to tune for the subset of x86-64 CPUs that support your `-m` options, like `-mavx` and `-mavx2`. But yes, aligned loads would have fixed your problem. – Peter Cordes Jul 13 '19 at 14:19
  • @PeterCordes Thanks for the link. Good to know – chtz Jul 13 '19 at 14:23
  • Nice use of `vextractf128`, I'd forgotten that was also a pure store (not p5), or hadn't even thought of a way to build outputs from inputs with more blends and fewer shuffles. BTW, your `2*vpermilps` is just how clang chooses to compile it, wasting 1 byte of code size and not matching your intrinsic. GCC compiles as written to `2x vshufps same,same`, allowing a 2-byte VEX prefix instead of 3-byte for the extra mandatory-prefix bits for `vpermilps`. Both compilers unfortunately use indexed addressing modes for the `vcvtpd2ps` memory operands, so they will unlaminate on SnB-family :/ – Peter Cordes Jul 13 '19 at 14:25
  • 1
    It should actually be possible to replace one of the register-`vinsertf128` by storing the `vcvtpd2ps` results to memory and using a memory-`vinsertf128`. That could save one p5-uOp, at the cost of another p4-uOp. Probably not easy to convince the compiler to do that without inline-assembly, though (and not sure if that actually helps). – chtz Jul 13 '19 at 14:43
  • 1
    `vinsertf128` from memory does still need a uop to merge into the existing YMM, but it can run on any port (like a blend, not a shuffle). So yes this should help on current Intel unless it introduces a bottleneck on front-end uop throughput. It won't help on AMD: AMD vinsertf128 reg,reg is already any port. If the output array is aligned, we shouldn't have any cache-line splits because you're building full vectors for aligned stores, not overlapping. So with 7 store-data uops we wouldn't expect any replays for split stores. (And I'm not sure the store-data uops need replays, maybe just addr) – Peter Cordes Jul 13 '19 at 14:56
  • 1
    So getting down to a bottleneck of 7 shuffles + 7 stores for 8 structs is very good, and there shouldn't be hidden overhead from split-store uop replays. It might be possible to use a `volatile __m128` to hand-hold a compiler into it, but that tends to prevent future smart compilers from doing something better. And it's maybe worse for AMD and pretty definitely worse for Ice Lake (2 store ports, 2 shuffle ports). – Peter Cordes Jul 13 '19 at 15:00
  • 2
    ^^ This is tested on my PC and clocking in at an astonishing 550ms vs. 1300ms for the same loop when processing 1 billion elements. – Dr. Andrew Burnett-Thompson Jul 15 '19 at 10:14
  • @Dr.ABT what cpu did you test on -- did you compile with `-march=native`? For Intel it may be worth trying to replace one of the register-`vinsertf128` by a memory-version of that instruction. Can you make your memory aligned? And I guess you only benchmarked single-core performance? – chtz Jul 17 '19 at 05:13