What do you do without fast gather and scatter in AVX2 instructions?

Question

I'm writing a program to detect primes numbers. One part is bit sieving possible candidates out. I've written a fairly fast program but I thought I'd see if anyone has some better ideas. My program could use some fast gather and scatter instructions but I'm limited to AVX2 hardware for a x86 architecture (I know AVX-512 has these though I'd not sure how fast they are).

#include <stdint.h>
#include <immintrin.h>

#define USE_AVX2

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX)
{
    const uint64_t totalX = 5000000;
#ifdef USE_AVX2
    uint64_t indx[4], bits[4];

    const __m256i sieveX2 = _mm256_set1_epi64x((uint64_t)(sieveX));
    const __m256i total = _mm256_set1_epi64x(totalX - 1);
    const __m256i mask = _mm256_set1_epi64x(0x3f);

    // Just filling with some typical values (not really constant)
    __m256i ans = _mm256_set_epi64x(58, 52, 154, 1);
    __m256i ans2 = _mm256_set_epi64x(142, 70, 136, 100);

    __m256i sum = _mm256_set_epi64x(201, 213, 219, 237);    // 3x primes
    __m256i sum2 = _mm256_set_epi64x(201, 213, 219, 237);   // This aren't always the same

    // Actually algorithm can changes these
    __m256i mod1 = _mm256_set1_epi64x(1);
    __m256i mod3 = _mm256_set1_epi64x(1);

    __m256i mod2, mod4, sum3;

    // Sieve until all factors (start under 32-bit threshold) exceed the limit
    do {
        // Sieve until one of the factors exceeds the limit
        do {
            // Compiler does a nice job converting these into extracts
            *(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans), 3), sieveX2);
            *(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod1, _mm256_and_si256(mask, ans));

            ans = _mm256_add_epi64(ans, sum);

            // Early on these locations can overlap
            *(uint64_t *)(indx[0]) |= bits[0];
            *(uint64_t *)(indx[1]) |= bits[1];
            *(uint64_t *)(indx[2]) |= bits[2];
            *(uint64_t *)(indx[3]) |= bits[3];

            mod2 = _mm256_sub_epi64(total, ans);

            *(__m256i *)(&indx[0]) = _mm256_add_epi64(_mm256_srli_epi64(_mm256_andnot_si256(mask, ans2), 3), sieveX2);
            *(__m256i *)(&bits[0]) = _mm256_sllv_epi64(mod3, _mm256_and_si256(mask, ans2));

            ans2 = _mm256_add_epi64(ans2, sum2);

            // Two types of candidates are being performed at once
            *(uint64_t *)(indx[0]) |= bits[0];
            *(uint64_t *)(indx[1]) |= bits[1];
            *(uint64_t *)(indx[2]) |= bits[2];
            *(uint64_t *)(indx[3]) |= bits[3];

            mod4 = _mm256_sub_epi64(total, ans2);
        } while (!_mm256_movemask_pd(_mm256_castsi256_pd(_mm256_or_si256(mod2, mod4))));

        // Remove one factor
        mod2 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum), _mm256_castsi256_pd(mod2)));
        mod4 = _mm256_castpd_si256(_mm256_blendv_pd(_mm256_setzero_pd(), _mm256_castsi256_pd(sum2), _mm256_castsi256_pd(mod4)));
        ans = _mm256_sub_epi64(ans, mod2);
        ans2 = _mm256_sub_epi64(ans2, mod4);
        sum = _mm256_sub_epi64(sum, mod2);
        sum2 = _mm256_sub_epi64(sum2, mod4);
        sum3 = _mm256_or_si256(sum, sum2);
     } while (!_mm256_testz_si256(sum3, sum3));
#else
     // Just some example values (not really constant - compiler will optimize away code incorrectly)
     uint64_t cur = 58;
     uint64_t cur2 = 142;
     uint64_t factor = 67;

     if (cur < cur2) {
        std::swap(cur, cur2);
    }
    while (cur < totalX) {
        sieveX[cur >> 6] |= (1ULL << (cur & 0x3f));
        sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
        cur += factor;
        cur2 += factor;
    }
    while (cur2 < totalX) {
        sieveX[cur2 >> 6] |= (1ULL << (cur2 & 0x3f));
        cur2 += factor;
    }
#endif
}

Be warned that the locations can overlap at first. After a short while in the loop, this is not the case. I'd be happy to using a different approach if this is possible. Around 82% of the time within this part of the algorithm is in this loop. Hopefully this isn't too close to other posted questions.

Also, even with AVX512 scatter, possible overlap (like `cur[0] == cur[1]`) means you have to check for conflicts so that gather / SIMD OR / scatter has the same semantics as doing them one at a time. If you can rule that out after a certain point, you might go scalar or with `vpconflictq` / retry until then, and then use a loop that doesn't check for conflicts. — Peter Cordes, Jul 02 '18 at 01:59
Updated the code for gcc (options "-mavx2 -O3") even though I'm using MSVC 17. Remember I "DO NOT" have access to an AVX-512 compiler and machine. So using AVX-512 won't help me. I would love to be able to use _mm256_conflict_epi64 and other advanced gather/scatter commands. — ChipK, Jul 02 '18 at 04:38
Code review: Use `alignas(32) uint64_t *indx[4]` so you don't have to cast. (And don't declare it with array size 0, unless that's some kind of hack that gets more efficient compiler output from skipping stack alignment because the compiler isn't actually storing/reloading. Was `uint64_t indx[0]` a typo?) I had to add a declaration for `__m256i sum3;` to get it to compile (https://godbolt.org/g/zrSCDa), but yeah the SIMD address calc idea from my answer looks good for Haswell/Skylake. Handy (or not a coincidence :P) that you already had the right mask in a vector. — Peter Cordes, Jul 02 '18 at 22:57
Yup, typos and I removed a needed variable declaration. I only compiled it - not run it. — ChipK, Jul 03 '18 at 01:40
It's too bad scalar memory-destination `bts` is not fast. [`bts [rdi], rax`](http://felixcloutier.com/x86/BTS.html) would set that bit in the bit-string, even if that's outside the dword selected by `[rdi]`. (That kind of crazy-CISC behaviour is *why* it's not fast, though! like 10 uops on Skylake.) — Peter Cordes, Jul 03 '18 at 02:51

score 15 · Accepted Answer · edited Jul 31 '20 at 04:25

IDK why you use different parts of the same cur[8] array for indices and values; it made the source harder to understand to figure out that there was only one real array. The other was just to bounce vectors to scalars.

It looks like you're only ever going vector -> scalar, not inserting scalars back into a vector. And also that nothing inside the loop depends on any data in sieveX[]; I'm not familiar with your sieving algorithm but I guess the point of this is to create data in memory for later use.

AVX2 has gathers (not scatters), but they're only fast on Skylake and newer. They're ok on Broadwell, slowish on Haswell, and slow on AMD. (Like one per 12 clocks for Ryzen's vpgatherqq). See http://agner.org/optimize/ and other performance links in the x86 tag wiki.

Intel's optimization manual has a small section on manual gather / scatter (using insert/extract or movhps) vs. hardware instructions, possibly worth reading. In this case where the indices are runtime variables (not a constant stride or something), I think Skylake can benefit from AVX2 gather instructions here.

See Intel's intrinsics guide to look up the intrinsic for asm instructions like movhps. I'm just talking about what you want to get your compiler to emit, because that's what's important and the asm mnemonics are shorter to type and don't need casting. You have to know the asm mnemonic to look them up in Agner Fog's instruction tables, or to read compiler output from auto-vectorization, so I usually think in asm and then translate that to intrinsics.

With AVX, you have 3 main options:

do everything scalar. Register pressure may be a problem, but generating indices as needed (instead of doing all 4 adds or subs to generate curr[4..7] at once) might help. Unless those mask vectors have different values in different elements.

(Using memory sources for scalar constants might not be bad, though, if they don't fit in 32-bit immediates and if you don't bottleneck on 2 memory ops per clock. The memory-destination or instructions would use indexed addressing modes, so the dedicated store-AGU on port 7 on Haswell and later couldn't be used. Thus AGU throughput could be a bottleneck.)

Extracting all 4 elements of a vector as scalar is more expensive than 4x scalar add or shift instructions, but you're doing more work than that. Still, with BMI2 for 1 uops variable-count shifts (instead of 3 on Intel), it might not be terrible. I think we can do better with SIMD, though, especially with careful tuning.

extract indices and values to scalar like you're doing now, so the OR into sieveX[] is pure scalar. Works even when two or more indices are the same.

This costs you about 7 uops per ymm vector -> 4x scalar registers using extract ALU instructions, or 5 uops using store/reload (worth considering for the compiler, maybe for one or two of the 4 vector extracts, because this code probably doesn't manage to bottleneck on load / store port throughput.) If the compiler turns store/reload in the C source into shuffle/extract instructions, though, you can't easily override its strategy except maybe by using volatile. And BTW, you'd want to use alignas(32) cur[8] to make sure actual vector stores don't cross a cache-line boundary.

or [rdi + rax*8], rdx (with an indexed addressing mode preventing full micro-fusion) is 3 uops on modern Intel CPUs (Haswell and later). We could avoid an indexed addressing mode (making it 2 uops for the front-end) by scaling + adding to the array base address using SIMD: e.g. srli by 3 instead of 6, mask off the low 3 bits (vpand), and vpaddq with set1_epi64(sieveX). So this costs 2 extra SIMD instructions to save 4 uops on SnB-family, per vector of indices. (You'd extracting uint64_t* pointer elements instead of uint64_t indices. Or if sieveX can be a 32-bit absolute address¹, you could skip the vpaddq and extract already-scaled indices for the same gain.)

It would also enable the store-address uops to run on port 7 (Haswell and later); the simple AGU on port7 can only handle non-indexed addressing modes. (This makes extracting values to scalar with store+reload more attractive. You want lower latency for extracting indices, because the values aren't needed until after the load part of a memory-dst or completes.) It does mean more unfused-domain uops for the scheduler / execution units, but could well be worth the tradeoff.

This isn't a win on other AVX2 CPUs (Excavator / Ryzen or Xeon Phi); only SnB-family has a front-end cost and execution-port restrictions for indexed addressing modes.

extract indices, manually gather into a vector with vmovq / vmovhps for a SIMD vpor, then scatter back with vmovq / vmovhps.

Just like a HW gather/scatter, correctness requires that all indices are unique, so you'll want to use one of the above options until you get to that point in your algo. (vector conflict detection + fallback would not be worth the cost vs. just always extracting to scalar: Fallback implementation for conflict detection in AVX2).

See selectively xor-ing elements of a list with AVX2 instructions for an intrinsics version. (I knew I'd recently written an answer with a manual gather / scatter, but took me a while to find it!) In that case I only used 128-bit vectors because there wasn't any extra SIMD work to justify the extra vinserti128 / vextracti128.

Actually I think here you'd want to extract the high half of the _mm256_sllv_epi64 result so you have (the data that would be) cur[4..5] and cur[6..7] in two separate __m128i variables. You'd have vextracti128 / 2x vpor xmm instead of vinserti128 / vpor ymm / vextracti128.

The former has less port5 pressure, and has better instruction-level parallelism: The two 128-bit halves are separate dependency chains that don't get coupled to each other, so store/reload bottlenecks (and cache misses) impact fewer dependent uops, allowing out-of-order execution to keep working on more stuff while waiting.

Doing address calculation in a 256b vector and extracting pointers instead of indices would make vmovhps loads cheaper on Intel (indexed loads can't stay micro-fused to vmovhps²). See the previous bullet point. But vmovq loads/stores are always a single uop, and vmovhps indexed stores can stay micro-fused on Haswell and later, so it's break-even for front-end throughput and worse on AMD or KNL. It also means more unfused-domain uops for the scheduler / execution units, which looks like more of a potential bottleneck than port2/3 AGU pressure. The only advantage is that the store-address uops can run on port 7, relieving some pressure.

AVX2 gives us one new option:

AVX2 vpgatherqq for the gather (_mm256_i64gather_epi64(sieveX, srli_result, 8)), then extract indices and manually scatter. So it's exactly like the manual gather / manual scatter, except you replace the manual gather with an AVX2 hardware gather. (Two 128-bit gathers cost more than one 256-bit gather, so you would want to take the instruction-level parallelism hit and gather into a single 256-bit register).

Possibly a win on Skylake (where vpgatherqq ymm is 4 uops / 4c throughput, plus 1 uop of setup), but not even Broadwell (9 uops, one per 6c throughput) and definitely not Haswell (22 uops / 9c throughput). You do need the indices in scalar registers anyway, so you're only saving the manual-gather part of the work. That's pretty cheap.

Total cost for each strategy on Skylake

It looks like this won't bottleneck badly on any one port. GP reg->xmm needs port 5, but xmm->int needs port 0 on SnB-family CPUs, so it's less likely to bottleneck on port 5 when mixed with the shuffles needed for extracting. (e.g. vpextrq rax, xmm0, 1 is a 2 uop instruction, one port 5 shuffle uop to grab the high qword, and a port 0 uop to send that data from SIMD to the integer domain.)

So your SIMD calculation where you need to frequently extract a vector to scalar is less bad than if you needed to frequently insert scalar calculation results into vectors. See also Loading an xmm from GP regs, but that's talking about data that starts in GP regs, not memory.

extract both / scalar OR: Total = 24 uops = 6 cycles of front-end throughput.
vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake)
2x vextracti128 (2 uops for port 5)
4x vmovq (4 p0)
4x vpextrq (8: 4p0 4p5)
4x or [r], r (4x2 = 8 front-end uops each. backend: 4p0156 4p23 (load) 4p237 (store-addres) 4p4 (store-data)). Non-indexed addressing mode.

Total = 6 uops for p5, just barely fits. Store/reload for a data extract looks sensible, if you could get your compiler to do that. (But compilers don't typically model the pipeline in enough detail to use a mix of strategies in the same loop to balance port pressure.)

manual gather/scatter: 20 uops, 5 cycles of front-end throughput (Haswell / BDW / Skylake). Also good on Ryzen.
(optional, probably not worth it): vpaddq + vpand address calc (2 uops for port 0/1/5 on Skylake) Skip these if you could use non-VEX movhps for a 1-uop micro-fused indexed load. (But then p237 stores become p23).
vextracti128 pointers (1 uop for port 5)
2x vmovq extract (2p0)
2x vpextrq (4 = 2p0 2p5)
2x vmovq load (2p23)
2x vmovhps xmm, xmm, [r] non-indexed load (2 front-end uops micro-fused: 2p23 + 2p5)
vextracti128 split the data (p5)
2x vpor xmm (2p015)
2x vmovq store (2x 1 micro-fused uop, 2p237 + 2p4)
2x vmovhps store (2x 1 micro-fused uop, 2p237 + 2p4)

Port bottlenecks: 4 p0 and 4 p5 fits comfortably in 5 cycles, especially when you mix this with your loop which can run several of its uops on port 1. On Haswell paddq is only p15 (not p015), and shifts are only p0 (not p01). AVX2 _mm256_sllv_epi64 is 1 uop (p01) on Skylake, but on Haswell it's 3 uops = 2p0 + p5. So Haswell might be closer to a p0 or p5 bottleneck for this loop, in which case you might want to look at a store/reload extract strategy for one vector of indices.

Skipping the SIMD address calc is probably good, because AGU pressure doesn't look like a problem unless you use a store/reload extract. And it means fewer instruction / smaller code-size and fewer uops in the uop cache. (un-lamination doesn't happen until after the decoders / uop cache, so you still benefit from micro-fusion in the early parts of the front-end, just not at the issue bottleneck.)

Skylake AVX2 gather / manual scatter: Total = 18 uops, 4.5 cycles of front-end throughput. (Worse on any earlier uarch or on AMD).
vextracti128 indices (1 uop for port 5)
2x vmovq extract (2p0)
2x vpextrq (4 = 2p0 2p5)
vpcmpeqd ymm0,ymm0,ymm0 create an all-ones mask for vpgatherqq (p015)
vpgatherqq ymm1, [rdi + ymm2*8], ymm0 4 uops for some ports.
vpor ymm (p015)
vextracti128 on the OR result (p5)
2x vmovq store (2x 1 micro-fused uop, 2p23 + 2p4). Note no port7, we're using indexed stores.
2x vmovhps store (2x 1 micro-fused uop, 2p23 + 2p4).

So even with the best throughput choice, we're still only managing 4 loads / 4 stores per 4.5 cycles, and that's without considering the SIMD work in the loop which costs some front-end throughput. So we're not close to bottlenecking on AGU throughput and having to worry about using port 7.

We could maybe think about store/reload for one of the extracts (if we were the compiler), replacing the 7 uop 5 instruction vextracti128 / 2x vmovq / 2x vpextrq sequence with a 5 uops store / 4x load.

Overall: One loop until we're done with conflicts, then a SIMD gather loop

You say that after a certain point, you don't have conflicts (overlap) between the indices like cur[0] == cur[2].

You definitely want a separate loop that doesn't check for conflicts at all to take advantage of this. Even if you had AVX512, Skylake's vpconflictq is micro-code and not fast. (KNL has single-uop vpconflictq but it's still faster to avoid it entirely).

I'll leave it up to you (or a separate question) how to figure out for sure when you're done with conflicts and can leave the loop that accounts for that possibility.

You probably want the extract indices + data strategy while there can be conflicts. SIMD conflict checking is possible, but it's not cheap, 11 uops for 32-bit elements: Fallback implementation for conflict detection in AVX2. A qword version is obviously much cheaper than dword (fewer shuffles and compares to get all against all), but you probably still only want to do it every 10 iterations or so of your extract loop.

There's not a huge speedup from the best scalar-or version to the best gather version (6 cycles vs. 4.5 isn't accounting for the other work in the loop, so the ratio is even smaller than that). Leaving the slightly slower version ASAP is not worth making it a lot slower.

So if you can reliably detect when you're done with conflicts, use something like

int conflictcheck = 10;

do {

    if (--conflictcheck == 0) {
       vector stuff to check for conflicts
       if (no conflicts now or in the future)
           break;

       conflictcheck = 10;  // reset the down-counter
    }

    main loop body,  extract -> scalar OR strategy

} while(blah);


// then fall into the gather/scatter loop.
do {
    main loop body, gather + manual scatter strategy
} while();

That should compile to a dec / je which only costs 1 uop in the not-taken case.

Doing 9 extra iterations total of the slightly-slower loop is much better than doing thousands of extra expensive conflict checks.

Footnote 1:

If sieveX is static and you're building non-PIC code on Linux (not MacOS) then its address will fit in a disp32 as part of a [reg+disp32] addressing mode. In that case you can leave out the vpaddq. But getting a compiler to treat a uint64_t as an already-scaled array index (with its low bits cleared) would be ugly. Probably have to cast sieveX to uintptr_t and add, then cast back.

This isn't possible in a PIE executable or shared library (where 32-bit absolute addresses aren't allowed), or on OS X at all (where static addresses are always above 2^32). I'm not sure what Windows allows. Note that [disp32 + reg*8] only has 1 register, but is still an indexed addressing mode so all the SnB-family penalties apply. But if you don't need scaling, reg + disp32 is just base + disp32.

Footnote 2: Fun fact: non-VEX movhps loads can stay micro-fused on Haswell. It won't cause an SSE/AVX stall on Skylake, but you won't get a compiler to emit the non-VEX version in the middle of an AVX2 function.

IACA (Intel's static analysis tool) gets this wrong, though. :( What is IACA and how do I use it?.

This is basically a missed-optimization for -mtune=skylake, but it would stall on Haswell: Why is this SSE code 6 times slower without VZEROUPPER on Skylake?.

The "penalty A" (execute SSE with dirty upper) on Skylake is merely a false dependency on that one register. (And a merging uop for instructions that would otherwise be write-only, but movhps is already a read-modify-write of its destination.) I tested this on Skylake with Linux perf to count uops, with this loop:

    mov     r15d, 100000000

.loop:
    vpaddq  ymm0, ymm1, ymm2      ; dirty the upper part
    vpaddq  ymm3, ymm1, ymm2      ; dirty another register for good measure

    vmovq  xmm0, [rdi+rbx*8]       ; zero the full register, breaking dependencies
    movhps xmm0, [rdi+rbx*8+8]     ; RMW the low 128 bits
                          ; fast on Skylake, will stall on Haswell

    dec r15d
    jnz .loop

The loop runs at ~1.25 cycles per iteration on Skylake (i7-6700k), maxing out the front-end throughput of 4 uops per clock. 5 total fused-domain uops (uops_issued.any), 6 unfused-domain uops (uops_executed.thread). So micro-fusion was definitely happening for movhps without any SSE/AVX problems.

Changing it to vmovhps xmm0, xmm0, [rdi+rbx*8+8] slowed it down to 1.50 cycles per iteration, now 6 fused-domain, but still the same 6 unfused-domain uops.

There's no extra uop if the upper half of ymm0 is dirty when movhps xmm0, [mem] runs. I tested by commenting out the vmovq. But changing vmovq to movq does result in an extra uop: movq becomes a micro-fused load+merge that replaces the low 64 bits (and still zeros the upper 64 bits of xmm0 so it's not quite movlps).

Also note that pinsrq xmm0, [mem], 1 can't micro fuse even without VEX. But with VEX, you should prefer vmovhps for code-size reasons.

Your compiler may want to "optimize" the intrinsic for movhps on integer data into vpinsrq, though, I didn't check.

You're right. The code needed some cleaning up (which I did to some degree). I added the C code as well as reference. This function is mainly code pulled out of function of a much larger size. Thanks for spending so much time. Once I have a chance to go thru it all I'll probably have more comments. — ChipK, Jul 02 '18 at 14:35
Updated the code to use fused uop approach. Note: sieveX is a 64-bit address. — ChipK, Jul 02 '18 at 17:32
@ChipK: yeah, I didn't think it was likely you could take advantage of it being a 32-bit absolute address (i.e. static storage in non-PIC code). That's why I put that in a footnote and wrote the rest of the answer without assuming that. :P — Peter Cordes, Jul 02 '18 at 18:48
Wow! I guess my next task to separate out the the conflict portion of the loop from the non-conflict and then try some of the methods you described. It gives me hope that there is some performance to be gained (with "hard"-ish work). Thanks again! — ChipK, Jul 02 '18 at 22:37
@ChipK: note that with two separate gather/scatters into two 128-bit vectors, you can arrange it so you can use the manual-gather/scatter loop even if `idx[0]` and `idx[1]` conflict. i.e. gather `idx[0]` and `idx[2]` together, and `1,3` together. Scatter two results before gathering the next two. (Arrange your 256-bit vectors to interleave elements in that order.) If that stride helps, this could let you get into the faster loop sooner. (And reduce the cost of conflict detection to maybe one shuffle / compare / movmsk / test.) ... — Peter Cordes, Jul 02 '18 at 23:02
... But if strided / partial conflict avoidance doesn't help, then do all 4 loads before any of the stores in the no-conflict loop so hardware memory-disambiguation has less work to do figuring out if an unknown-address store conflicts with a load. — Peter Cordes, Jul 02 '18 at 23:05
It's possible for any of idx[0], idx[1], idx[2] or idx[3] to conflict with any of each other. I think it is possible for the conflict to occur during the whole loop (though the number of iterations would be fairly low in those cases). — ChipK, Jul 02 '18 at 23:29
@ChipK: ok, at first there can be full conflicts, but does the point where `idx[0] != idx[2]` (now and forever) come sooner than the point where `idx[0] != idx[1]`? If yes, you can leave the slow conflict-safe loop sooner, and make the checks cheaper. And BTW, you could maybe set things up with a mix of strategies, using scalar OR for 1 and 2, and manual gather/scatter for `idx[0], idx[3]` so only that conflict mattered. (Of course you'd actually remap your elements so those two were in the same 128 bit lane as each other. Pref. low 128 so conflict checking is cheaper.) — Peter Cordes, Jul 02 '18 at 23:39
The start values for ans & ans2 are anywhere from [0, 6 * prime) due to alignment and they are being incremented at 3 * prime. Each element of the vector holds a different prime. So I would need a quick way to calculate how many increments occur until there is no conflict. Either that or do so many calculation and check for conflict, if so, continue slower way, if not, use gathers (or whatever approach is best). I've have to think about it some. I'd rather know how long to do the slower approach than waste cycles. — ChipK, Jul 03 '18 at 01:38
@ChipK: There's prob. not a huge speed diff, so a reasonable upper bound you can compute quickly is good. But checking for conflicts every 10 or 20 iters like the example in my answer should be fine. Avg cost of maybe `1 + 6/20` or so uops inside the "slow" loop, vs. `1` for a counter, out of >40. Plus the differential cost of staying in the somewhat-slower loop for longer. (23 is the threshold where the branch-prediction pattern is too long for Skylake to lock on to it in my testing, for a short loop nested inside another loop. So 20 may be safe-ish. But profile for branch misses.) — Peter Cordes, Jul 03 '18 at 02:47
The vector has sequential primes. I worry that it is possible to start with no conflicts but at a later time have some. — ChipK, Jul 03 '18 at 03:08
I started looking at what your code is actually doing. It's not at all clear that splitting `ans` bit-index into word-offset and bit-within-word with SIMD and then extracting both is even a good idea. Do you don't do *that* much with SIMD, so pure scalar might be good after all, especially when `mod1=mod3=set1(1)`. You're just striding by a constant number of bits, which you should be able to do more cheaply with scalar. I'm going to leave this answer here for the general question, but in your specific case I think you have the wrong approach entirely. — Peter Cordes, Jul 03 '18 at 03:18
mod1 and mod3 is a vector of ones and/or zeros values - AKA { 1, 0, 0, 1 } is valid. The primes being sieved are being changed in an outer loop. It's quite possible there is a better approach to this. I was thinking that instead of starting with a pure conflict test, I should check if the values are in sorted order without any conflicts. I think the vector approach works well for checking if the values have exceeded the range. Having 4 independent checks/branches isn't predictable and is rather slow. — ChipK, Jul 03 '18 at 03:27
@ChipK: any zero elements in `mod1` or `mod3` create *purely* wasted work, because the `bits[]` value for that element will always be zero. The code calling this function should wait until it has 4 factors that need some bits set, or else use pure scalar when there are fewer. — Peter Cordes, Jul 03 '18 at 03:30
I realize it is wasted work but creating the initial starting values for each is quite expensive. Using vectors has sped up my algorithm by 2x. — ChipK, Jul 03 '18 at 03:33
@ChipK: I think we can get the same work done (with the same memory access pattern) in fewer uops with scalar. So that's great that your changes gave you a big speedup, but I think there's more performance to be found with a different strategy. For example, given a bit-index in RDX, `shrx rcx, rdx, r10`/ `mov eax, [rdi+rcx*4]` / `bts eax, edx` / `mov [rdi+rcx*4], eax` is 4 uops to set a bit in memory. (BTS masks the shift count for free when used with a register source + dst. With a constant 5 in `r10`, `shrx` is a 1 uop copy+shift to get a dword index) — Peter Cordes, Jul 03 '18 at 03:44
Is there a fast way to check if elements within a vector are in sorted order or should I post this as another question? Even if I don't use this approach, it would be interesting to see the solution. — ChipK, Jul 03 '18 at 03:44
If they're all unique, I think you just need one `vpermq` + `vpcmpgtq`, and check that all elements of the result are true (shuffle so that you're doing a>b, b>c, and c>d. If you shuffle the left hand side of the `>`, you can repeat an element so you avoid doing `expected_min > expected_min or anything else` and getting a zero. Or shuffle so the expected result is all-zero. And BTW, don't be afraid to use scalar while setting up the elements outside this loop. — Peter Cordes, Jul 03 '18 at 03:49
Is there a way to use the BTS instruction without assembly? I would strongly prefer not to use assembly unless there is a very compelling reason. — ChipK, Jul 03 '18 at 04:30
@ChipK: use a compiler that knows how to use it as a peephole optimization for `x |= 1ULL << (bit&0x3f);` when it has a variable in a register. I think clang knows how to use it, and maybe gcc sometimes. — Peter Cordes, Jul 03 '18 at 04:31
@ChipK: I checked, gcc knows how to use it but frequently decides not to. clang is great, MSVC uses it but doesn't optimize away the `and edx,31`. https://godbolt.org/g/WifXhC. So it looks like you get better asm from MSVC by using unmasked shift counts, but that's UB in C++. IDK whether MSVC defines the behaviour as wrapping the shift count, or if it's subject to breakage, too. — Peter Cordes, Jul 03 '18 at 05:00
Let us [continue this discussion in chat](https://chat.stackoverflow.com/rooms/174281/discussion-between-chipk-and-peter-cordes). — ChipK, Jul 03 '18 at 17:29

score 7 · Answer 2 · answered Jul 03 '18 at 09:49

I just looked at exactly what you're doing here: For the mod1 = mod3 = _mm256_set1_epi64x(1); case, you're just setting single bits in a bitmap with elements of ans as the index.

And it's unrolled by two, with ans and ans2 running in parallel, using mod1 << ans and mod3 << ans2. Comment your code and explain what's going on in the big picture using English text! This is just a very complicated implementation of the bit-setting loop of a normal Sieve of Eratosthenes. (So it would have been nice if the question had said so in the first place.)

Unrolling with multiple start/strides in parallel is a very good optimization, so you normally set multiple bits in a cache line while it's still hot in L1d. Cache-blocking for fewer different factors at once has similar advantages. Iterate over the same 8kiB or 16kiB chunk of memory repeatedly for multiple factors (strides) before moving on to the next. Unrolling with 4 offsets for each of 2 different strides could be a good way to create more ILP.

The more strides you run in parallel, the slower you go through new cache lines the first time you touch them, though. (Giving cache / TLB prefetch room to avoid even an initial stall). So cache blocking doesn't remove all the benefit of multiple strides.

Possible special case for strides < 256

A single 256-bit vector load/VPOR/store can set multiple bits. The trick is creating a vector constant, or set of vector constants, with bits in the right position. The repeating pattern is something like LCM(256, bit_stride) bits long, though. For example stride=3 would repeat in a pattern that's 3 vectors long. This very quickly gets unusable for odd / prime strides unless there's something more clever :(

64-bit scalar is interesting because bitwise rotate is available to create a sequence of patterns, but variable-count rotate on SnB-family CPUs costs 2 uops.

There might be more you can do with this; maybe unaligned loads could help somehow.

A repeating pattern of bitmasks could be useful even for the large-stride case, e.g. rotating by stride % 8 every time. But that would be more useful if you were JITing a loop that hard-coded the pattern into or byte [mem], imm8, with an unroll factor chosen to be congruent with repeat length.

Conflict reduction with narrower loads/stores

You don't have to load/modify/store 64-bit chunks when you're only setting a single bit. The narrower your RMW operations, the closer your bit-indices can be without conflicting.

(But you don't have a long loop-carried dep chain on the same location; you will move on before OoO exec stalls waiting for a reloads at the end of a long chain. So if conflicts aren't a correctness problem, it's unlikely to make a big perf difference here. Unlike a bitmap histogram or something where a long chain of repeated hits on nearby bits could easily happen.)

32-bit elements would be an obvious choice. x86 can efficiently load/store dwords to/from SIMD registers as well as scalar. (scalar byte ops are efficient, too, but byte stores from SIMD regs always require multiple uops with pextrb.)

If you're not gathering into SIMD registers, the SIMD element width for ans / ans2 doesn't have to match the RMW width. 32-bit RMW has advantages over 8-bit if you want to split a bit-index into address / bit-offset in scalar, using shifts or bts that implicitly mask the shift count to 32 bits (or 64 bits for 64-bit shifts). But 8-bit shlx or bts doesn't exist.

The main advantage of using 64-bit SIMD elements is if you're calculating a pointer instead of just an index. If you could restrict your sieveX to 32 bits you'd still be able to do this. e.g. allocate with mmap(..., MAP_32BIT|MAP_ANONYMOUS, ...) on Linux. That's assuming you don't need more than 2^32 bits (512MiB) sieve space, so your bit indices always fit in 32-bit elements. If that's not the case, you could still use 32-bit element vectors up to that point, then use your current loop for the high part.

If you use 32-bit SIMD elements without restricting sieveX to be a 32-bit point pointer, you'd have to give up on using SIMD pointer calculations and just extract a bit-index, or still split in SIMD into idx/bit and extract both.

(With 32-bit elements, a SIMD -> scalar strategy based on store/reload looks even more attractive, but in C that's mostly up to your compiler.)

If you were manually gathering into 32-bit elements, you couldn't use movhps anymore. You'd have to use pinsrd / pextrd for the high 3 elements, and those never micro-fuse / always need a port5 uop on SnB-family. (Unlike movhps which is a pure store). But that means vpinsrd is still 2 uops with an indexed addressing mode. You could still use vmovhps for element 2 (then overwrite the top dword of the vector with vpinsrd); unaligned loads are cheap and it's ok to overlap the next element. But you can't do movhps stores, and that's where it was really good.

There are two big performance problems with your current strategy:

Apparently you're sometimes using this with some elements of mod1 or mod3 being 0, resulting in completely useless wasted work, doing [mem] |= 0 for those strides.

I think once an element in ans or ans2 reaches total, you're going to fall out of the inner loop and do ans -= sum 1 every time through the inner loop. You don't necessarily want to reset it back ans = sum (for that element) to redo the ORing (setting bits that were already set), because that memory will be cold in cache. What we really want is to pack the remaining still-in-use elements into known locations and enter other versions of the loop that only do 7, then 6, then 5 total elements. Then we're down to only 1 vector.

That seems really clunky. A better strategy for one element hitting the end might be to finish the remaining three in that vector with scalar, one at a time, then run the remaining single __m256i vector. If the strides are all nearby, you probably get good cache locality.

Cheaper scalar, or maybe still SIMD but extract only a bit-index

Splitting the bit-index into a qword index and a bitmask with SIMD and then extracting both separately costs a lot of uops for the scalar-OR case: so many that you're not bottlenecking on 1-per-clock store throughput, even with all the optimizations in my scatter/gather answer. (Cache misses may slow this down sometimes, but fewer front-end uops means a larger out-of-order window to find parallelism and keep more memory ops in flight.)

If we can get the compiler to make good scalar code to split a bit-index, we could consider pure scalar. Or at least extracting only bit-indices and skipping the SIMD shift/mask stuff.

It's too bad scalar memory-destination bts is not fast. bts [rdi], rax would set that bit in the bit-string, even if that's outside the dword selected by [rdi]. (That kind of crazy-CISC behaviour is why it's not fast, though! like 10 uops on Skylake.)

Pure scalar may not be ideal, though. I was playing around with this on Godbolt:

#include <immintrin.h>
#include <stdint.h>
#include <algorithm>

// Sieve the bits in array sieveX for later use
void sieveFactors(uint64_t *sieveX64, unsigned cur1, unsigned cur2, unsigned factor1, unsigned factor2)
{
    const uint64_t totalX = 5000000;
#ifdef USE_AVX2
//...
#else
     //uint64_t cur = 58;
     //uint64_t cur2 = 142;
     //uint64_t factor = 67;
     uint32_t *sieveX = (uint32_t*)sieveX64;

    if (cur1 > cur2) {
        // TODO: if factors can be different, properly check which will end first
        std::swap(cur1, cur2);
        std::swap(factor1, factor2);
    }
    // factor1 = factor2;  // is this always true?

    while (cur2 < totalX) {
         sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
         sieveX[cur2 >> 5] |= (1U << (cur2 & 0x1f));
         cur1 += factor1;
         cur2 += factor2;
    }
    while (cur1 < totalX) {
         sieveX[cur1 >> 5] |= (1U << (cur1 & 0x1f));
         cur1 += factor1;
    }
#endif
}

Note how I replaced your outer if() to choose between loops with sorting cur1, cur2.

GCC and clang put a 1 in a register outside the loop, and use shlx r9d, ecx, esi inside the loop to do 1U << (cur1 & 0x1f) in a single uop without destroying the 1. (MSVC uses load / BTS / store, but it's clunky with a lot of mov instructions. I don't know how to tell MSVC it's allowed to use BMI2.)

If an indexed addressing mode for or [mem], reg didn't cost an extra uop, this would be great.

The problem is that you need a shr reg, 5 in there somewhere, and that's destructive. Putting 5 in a register and using that to copy+shift the bit-index would be an ideal setup for load / BTS / store, but compilers don't know that optimization it seems.

Optimal(?) scalar split and use of a bit-index

   mov   ecx, 5    ; outside the loop

.loop:
    ; ESI is the bit-index.
    ; Could be pure scalar, or could come from an extract of ans directly

    shrx  edx, esi, ecx           ; EDX = ESI>>5 = dword index
    mov   eax, [rdi + rdx*4]
    bts   eax, esi                ; set esi % 32 in EAX
    mov   [rdi + rdx*4]


    more unrolled iterations

    ; add   esi, r10d               ; ans += factor if we're doing scalar

    ...
    cmp/jb .loop

So given a bit-index in a GP register, that's 4 uops to set the bit in memory. Notice that the load and store are both with mov, so indexed addressing modes have no penalty on Haswell and later.

But the best I could get compilers to make was 5, I think, using shlx / shr / or [mem], reg. (With an indexed addressing mode the or is 3 uops instead of 2.)

I think if you're willing to use hand-written asm, you can go faster with this scalar and ditch SIMD entirely. Conflicts are never a correctness problem for this.

Maybe you can even get a compiler to emit something comparable, but even a single extra uop per unrolled RMW is a big deal.

Thanks , I'll take a look at the your research. My whole system is setup to use 64-bit entries and changing it would require too much work IMO. Each bit array (or portion of) is limited to fit within L2 cache (not fully implemented yet). As you summarized, it is a segmented prime number sieve. My program has come a long way from the initial implementation (about 20-30X speedup). But I'd love/feel that there is more to begotten - as you've shown. Right now, my target platform is Windows but I'm hoping to eventually have a Linux version working (shouldn't be a big issue). — ChipK, Jul 03 '18 at 14:46
I don't mind switching to a scalar approach if it is faster. I mainly used a vector approach to calculate initial starting position (requires several 64-bit modulos) and to prevent segment overruns (as I said before, branch predictor was going crazy with 4 different stopping conditions). I don't think a JIT is in the cards - I used an approach similar to that to sieve small (less than 64) factors. Sorry for the lack of comments and weird variable names - I need to cleanup/rewrite some of the code for clarity. — ChipK, Jul 03 '18 at 14:52
The only times ORing with zero is present is in segments in which the factors (plus initial starting alignment) exceeds the ending value. So this doesn't happen unless it is using some small segments. Therefore, it's not really wasting time RMW useless elements except some exceptionally rare conditions. — ChipK, Jul 03 '18 at 15:29
I tried a scalar method for indexes and bits locations but it seems that I'm having register pressure and compiler can not hold all the data in registers (I'm sure hand assembly would probably fix that). — ChipK, Jul 03 '18 at 17:28
@ChipK: You can use `uint32_t *sieveX = (uint32_t*)sieveX64;` if you compile this file with strict-aliasing disabled, or just as an experiment if it happens to work. It's still the same contiguous bitmap regardless of what element size you use to access a bit and its containing byte / dword / qword. — Peter Cordes, Jul 03 '18 at 20:21
@ChipK: "*branch predictor was going crazy with 4 different stopping conditions*" Sort your factors so you only have to check 1 in the loop. — Peter Cordes, Jul 03 '18 at 20:25
My compiler still emits a "AND EDX, 3Fh" so it doesn't seem to have the peephole optimization. Sorting the conditions doesn't help, since the initial alignment can cause any of the stopping conditions to happen before any of the others. — ChipK, Jul 03 '18 at 21:46
@ChipK: ok, then do `(end - start) / factor` to find out which one will end first. Integer division sucks, but if it saves instructions in a big loop it's probably worth it. Re: `AND EDX, 3Fh`: use clang. Or use `*x |= (1ULL << (bit))` *without* masking in the source. Maybe `#ifdef _MSC_VER` to control that, or just do it as an experiment. Uncomment the small test functions in my Godbolt link to see what MSVC does for them: the C++ undefined behaviour still turns into the BTS we want with MSVC, so x86 shift-count masking does its job. — Peter Cordes, Jul 03 '18 at 21:55
Does this work only for 32-bit RMW? Yeah, divisions are what I'm trying to reduce/avoid. I'm play around with Godbolt — ChipK, Jul 03 '18 at 22:06
@ChipK: No, you just need the shift operand size to match the mask; that's why I used `1ULL` in my comment instead of `1U` in my Godbolt link. Re: divisions: outside the loop its worth doing if it saves work inside. Or use SIMD floating point: packed FP division is cheap compared to integer. — Peter Cordes, Jul 03 '18 at 22:09
I implemented the scalar version and it is indeed faster - about 10% — ChipK, Jul 04 '18 at 22:32
@ChipK: awesome, thanks for reporting on the results. I think much of the speedup you got from implementing SIMD was actually from the change in cache access pattern of running multiple strides in parallel, so you don't touch new memory as fast. (And / or your previous scalar code compiled less efficiently.) With scalar it's harder to efficiently keep as many running because of register limits, and you have to work out the loop condition more carefully, but it's easier to make a version that handles 7 or 6 remaining factors. — Peter Cordes, Jul 04 '18 at 22:39