Fallback implementation for conflict detection in AVX2

Question

AVX512CD contains the intrinsic _mm512_conflict_epi32(__m512i a) it returns a vector where for every element in a a bit is set if it has the same value. Is there a way to do something similar in AVX2?

I'm not interested in the extact bits I just need to know which elements are duplicates of the elements to their left (or right). I simply need to know if a scatter would conflict.

Basically I need an AVX2 equivalent for

__mm256i detect_conflict(__mm256i a) {
  __mm256i cd = _mm256_conflict_epi32(a);
  return _mm256_cmpgt_epi32(cd, _mm256_set1_epi32(0));
}

The only way I could think of is to use _mm256_permutevar8x32_epi32() shift each value right by 1 (across the lanes) and than do seven compares, mask out the unsed bits and than _mm256_or_si256() them together which is horribly slow.

In the `_epi64` case @harold [shows](https://stackoverflow.com/a/44575949) that with AVX2 only 2 comparisons are needed instead of 3. I think you can use the same idea here and save a few redundant comparisons. — wim, Jun 30 '17 at 10:07
@wim Thanks for sharing! It might be worth trying if simply detection that there is a conflict and then switch to scalar code is enough. The problem with doing more compares in a single compare is that I need to permute them back to the correct position and still have to make a lot of `ors`. — Christoph Diegelmann, Jun 30 '17 at 10:35
Without thinking about this too hard, it seems reasonable that the emulation *would* be slow. Intel would not have introduced a brand-new instruction for this operation if it were easy and efficient to do with previous instruction sets. — Cody Gray - on strike, Jun 30 '17 at 10:45
@CodyGray Of course there won't be a really fast answer to this. But there are e.g. reasonable alternatives to `_mm256_mask_compress_epi32()`. I'd be happy if I could at least get rid of some of the permutation masks to free some registers. — Christoph Diegelmann, Jun 30 '17 at 10:51
I'm quite curious about the latency and throughput numbers of `_mm256_conflict_epi32` Will it be as efficient as it looks like? — wim, Jun 30 '17 at 11:05
It's efficient on KNL (L: 3, T: 1), that doesn't really predict the future but at least it shows that it's possible (and to some extend "worth it") to make it fast. — harold, Jun 30 '17 at 11:14
I gave actually emulating `vpconflictd` a try, not tested, looks horrible (can be improved I'm sure): https://godbolt.org/g/oqtD5i — harold, Jun 30 '17 at 13:01
@harold Fun Fact: [The conflict detection instructions are not fast on Skylake. (10 - 20 cycles)](https://github.com/InstLatx64/InstLatx64/blob/master/GenuineIntel0050654_SkylakeX_InstLatX64.txt) By comparison Knights Landing has it in 3 cycles. So Skylake X doesn't seem to have native hardware for it. — Mysticial, Jun 30 '17 at 16:43
@Mysticial that's not really my idea of fun, but good to know — harold, Jun 30 '17 at 16:56
@Mysticial Thanks for sharing! Hopefully they improve it for coffee lake. — Christoph Diegelmann, Jun 30 '17 at 17:55
@Christoph Coffee Lake doesn't have AVX512 at all. So we're looking at Cannonlake at minimum. — Mysticial, Jun 30 '17 at 18:01
`8 choose 2` is 28 total scalar comparisons. That's 3.5 vectors of packed comparisons, but shuffles cost instructions, too. If we can find something that does 4 vector compares with only one shuffle between each compare, that'll be close to optimal. Different options might have more ILP for the compares or shuffles, or need more registers. Since `pcmpeqd` results together also costs an instruction, more shuffles might be worth it for fewer compares, except that SKL only has one shuffle unit but multiple compare and boolean ALUs. — Peter Cordes, Jul 01 '17 at 01:02
@harold: You can detect the presence/absence of conflicts for the whole vector (without finding their locations) significantly more efficiently. It's still not great, but see my answer for a 3-shuffle, 4-compare, 3-OR version. (9 cycle latency on Haswell to produce a zero / non-zero integer you can test/branch on. 1 cycle less if gcc optimized dep chains for Haswell instruction latencies). — Peter Cordes, Jul 01 '17 at 13:05
@harold: I know you know :P Mainly I was just happy to have finished writing this answer and wanted to ping people that might be interested :P And just to comment on exactly what the perf gain is from doing less work. — Peter Cordes, Jul 01 '17 at 13:14
Its a bit late and I will make it a whole answer once I find the time: If you have the same problem make sure to check for known relations within the elements. It turned out that my values where completly sorted so I only need to check with the previous element! — Christoph Diegelmann, Jul 18 '17 at 05:45

Peter Cordes · Accepted Answer · 2017-07-01T13:28:21.750

TL:DR: Since full detection of which elements conflict is expensive, it's probably worth doing more fall-back work in exchange for cheaper detection. This depends on your conflict-handling options / strategies.

I came up with a fairly efficient way check for presence/absence of conflicts without finding their locations, like this answer for 64-bit integer elements. It's actually faster than Skylake-AVX512's micro-coded vpconflictd ymm, but of course it gives you much less information. (KNL has fast vpconflictd).

You could use a fully-scalar fallback for all the elements if there are any conflicts. This would work well if conflicts are rare enough that branch-mispredicts don't kill performance. (AVX2 doesn't have scatter instructions in the first place, though, so I'm not sure exactly what you need this for.)

The only-left or only-right behaviour is hard, but my method can give you a mask of which elements have conflicts with any other element (e.g. v[0] == v[3] would result in both conflict[0] and conflict[3] being true). This costs only 1 extra shuffle, or maybe 0 with a redesign with this goal in mind.

(I misread the question at first; I thought you wanted to check both directions, rather than talking about two different implementation options for most of what vpconflictd does. Actually at first I thought you just wanted a presence/absence check, like bool any_conflicts(__m256i).)

Finding presence/absence of any conflicts: bool any_conflicts32(__m256i)

8 choose 2 is 28 total scalar comparisons. That's 3.5 vectors of packed comparisons. We should aim to do it with 4 vector compares, which leaves room for some redundancy.

Creating inputs for those compares will require shuffles, and some of those will have to be lane-crossing. 4 unique comparisons require at least 4 vectors (including the initial unshuffled copy), since 3 choose 2 is only 3.

Ideally as few as possible of the shuffles are lane-crossing, and there is lots of ILP for the compares and ORing of compare results. Also nice if the shuffles don't need a vector shuffle-control, just an imm8. Also good if they're not slow on AMD Ryzen, where 256b instructions are decoded into multiple 128b uops. (Some shuffles are worse than others for this, e.g. vperm2i128 is very bad; much worse than vpermq for swapping the high and low halves of a single vector. Unfortunately clang gets this wrong even with -mtune=znver1, and compiles _mm256_permute4x64_epi64 into vperm2i128 whenever it can).

I found a solution pretty early that achieves most of these goals: 3 shuffles, 4 compares. One of the shuffles is in-lane. All of them use an immediate control byte instead of a vector.

// returns a 0 or non-zero truth value
int any_conflicts32(__m256i v)
{
    __m256i hilo       = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(1,0,3,2));  // vpermq is much more efficient than vperm2i128 on Ryzen and KNL, same on HSW/SKL.
    __m256i inlane_rotr1 = _mm256_shuffle_epi32(v, _MM_SHUFFLE(0,3,2,1));
    __m256i full_rotl2 = _mm256_permute4x64_epi64(v, _MM_SHUFFLE(2,1,0,3));

    __m256i v_ir1 = _mm256_cmpeq_epi32(v, inlane_rotr1);
    __m256i v_hilo= _mm256_cmpeq_epi32(v, hilo);           // only really needs to be a 128b operation on the low lane, with leaving the upper lane zero.
                                                           // But there's no ideal way to express that with intrinsics, since _mm256_castsi128_si256 technically leaves the high lane undefined
                                                           // It's extremely likely that casting down and back up would always compile to correct code, though (using the result in a zero-extended register).
    __m256i hilo_ir1 = _mm256_cmpeq_epi32(hilo, inlane_rotr1);
    __m256i v_fl2 = _mm256_cmpeq_epi32(v, full_rotl2);

    __m256i t1 = _mm256_or_si256(v_ir1, v_hilo);
    __m256i t2 = _mm256_or_si256(t1, v_fl2);
    __m256i conflicts = _mm256_or_si256(t2, hilo_ir1);    // A serial dep chain instead of a tree is probably good because of resource conflicts from limited shuffle throughput

    // if you're going to branch on this, movemask/test/jcc is more efficient than ptest/jcc

    unsigned conflict_bitmap = _mm256_movemask_epi8(conflicts);  // With these shuffles, positions in the bitmap aren't actually meaningful
    return (bool)conflict_bitmap;
    return conflict_bitmap;
}

How I designed this:

I made a table of all the element-pairs that needed to be checked, and made columns for which shuffled operands could take care of that requirement.

I started with a few shuffles that could be done cheaply, and it turned out my early guesses worked well enough.

My design notes:

    // 7 6 5 4 | 3 2 1 0

    // h g f e | d c b a
    // e h g f | a d c b    // inlanerotr1 = vpshufd(v)
    // f e d c | b a h g    // fullrotl2 = vpermq(v)

    // d c b a | h g f e    // hilo = vperm2i128(v) or vpermq.  v:hilo has lots of redundancy.  The low half has all the information.

          v:lrot1      v:frotr2     lrotr1:frotl2                (incomplete)
 * ab   [0]v:lrotr1                 [3]lr1:fl2
 * ac                  [2]v:frotl2
 * ad   [3]v:lrotr1                 [2]lr1:fl2
 * ae                                                                           [0,4]v:hilo
 * af                                           [4]hilo:lrotr1
 * ag                  [0]v:frotl2
 * ah                                           [3]hilo:lrotr1

 * bc   [1]v:lrotr1
 * bd                  [3]v:frotl2                               [5]hilo:frotl2
 * be                                           [0]hilo:lrotr1
 * bf                                                                           [1,5]v:hilo
 * bg                               [0]lr1:fl2  [5]hilo:lrotr1
 * bh                  [1]v:frotl2

 * cd   [2]v:lrotr1
 * ce                  [4]v:frotl2  [4]lr1:fl2
 * cf                                           [1]hilo:lrotr1
 * cg                                                                           [2,6]v:hilo
 * ch                               [1]lr1:fl2  [6]hilo:lrotr1

 * de                                           [7]hilo:lrotr1
 * df                  [5]v:frotl2                               [7]hilo:frotl2
 * dg                               [5]lr1:fl2  [2]hilo:lrotr1
 * dh                                                                           [3,7]v:hilo

 * ef   [4]v:lrotr1                 [7]lr1:fl2
 * eg                  [6]v:frotl2
 * eh   [7]v:lrotr1                 [6]lr1:fl2

 * fg   [5]v:lrotr1
 * fh                  [7]v:frotl2

 * gh   [6]v:lrotr1

 */

It turns out that in-lane rotr1 == full rotl2 has a lot of redundancy, so it's not worth using. It also turns out that having all the allowed redundancy in v==hilo works fine.

If you care about which result is in which element (rather than just checking for presence/absence), then v == swap_hilo(lrotr1) could work instead of lrotr1 == hilo. But we also need swap_hilo(v), so this would mean an extra shuffle.

We could instead shuffle after hilo==lrotr1, for better ILP. Or maybe there's a different set of shuffles that gives us everything. Maybe if we consider VPERMD with a vector shuffle-control...

Compiler asm output vs. optimal asm

gcc6.3 -O3 -march=haswell produces:

Haswell has one shuffle unit (on port5).

   # assume ymm0 ready on cycle 0
    vpermq  ymm2, ymm0, 78     # hilo ready on cycle 3 (execution started on cycle 0)
    vpshufd ymm3, ymm0, 57     # lrotr1 ready on cycle 2  (started on cycle 1)
    vpermq  ymm1, ymm0, 147    # frotl2 ready on cycle 5  (started on 2)
    vpcmpeqd  ymm4, ymm2, ymm0  # starts on 3, ready on 4
    vpcmpeqd  ymm1, ymm1, ymm0  # starts on 5, ready on 6
    vpcmpeqd  ymm2, ymm2, ymm3  # starts on 3, ready on 4
    vpcmpeqd  ymm0, ymm0, ymm3  # starts on 2, ready on 3
    vpor    ymm1, ymm1, ymm4    # starts on 6, ready on 7
    vpor    ymm0, ymm0, ymm2    # starts on 4, ready on 5
    vpor    ymm0, ymm1, ymm0    # starts on 7, ready on 8
         # a different ordering of VPOR merging could have saved a cycle here.  /scold gcc
    vpmovmskb       eax, ymm0
    vzeroupper
    ret

So the best-case latency is 8 cycles to have a single vector ready, given resource conflicts from other instructions in this sequence but assuming no conflicts with past instructions still in the pipeline. (Should have been 7 cycles, but gcc re-ordered the dependency structure of my intrinsics putting more stuff dependent on the compare of the last shuffle result.)

This is faster than Skylake-AVX512's vpconflictd ymm, which has 17c latency, one per 10c throughput. (Of course, that gives you much more information, and @harold's emulation of it takes many more instructions).

Fortunately gcc didn't re-order the shuffles and introduce a potential write-back conflict. (e.g. putting the vpshufd last would mean that dispatching the shuffle uops to port5 in oldest-first order would have the vpshufd ready in the same cycle as the first vpermq (1c latency vs. 3c).) gcc did this for one version of the code (where I compared the wrong variable), so it seems that gcc -mtune=haswell doesn't take this into account. (Maybe it's not a big deal, I haven't measured to see what the real effect on latency is. I know the scheduler is smart about picking uops from the Reservation Station to avoid actual write-back conflicts, but IDK how smart it is, i.e. whether it would run the vpshufd ahead of a later vpermq to avoid a write-back conflict, since it would have to look-ahead to even see the upcoming writeback conflict. More likely it would just delay the vpshufd for an extra cycle before dispatching it.)

Anyway, this is why I put _mm_shuffle_epi32 in the middle in the C source, where it makes things easy for OOO execution.

Clang 4.0 goes berserk and packs each compare result down to 128b vectors (with vextracti128 / vpacksswb), then expands back to 256b after three vpor xmm before pmovmskb. I thought at first it was doing this because of -mtune=znver1, but it does it with -mtune=haswell as well. It does this even if we return a bool, which would let it just pmovmskb / test on the packed vector. /facepalm. It also pessimizes the hilo shuffle to vperm2i128, even with -mtune=znver1 (Ryzen), where vperm2i128 is 8 uops but vpermq is 3. (Agner Fog's insn tables for some reasons missed those, so I took those numbers from the FP equivalents vperm2f128 and vpermpd)

@harold says that using add instead of or stops clang from packing/unpacking, but vpaddd has lower throughput than vpor on Intel pre-Skylake.

Even better for Ryzen, the v == hilo compare can do only the low half. (i.e. use vpcmpeqd xmm2, xmm2, xmm3, which is only 1 uop instead of 2). We still need the full hilo for hilo == lrot1, though. So we can't just use vextracti128 xmm2, xmm0, 1 instead of the vpermq shuffle. vextracti128 has excellent performance on Ryzen: 1 uop, 1c latency, 0.33c throughput (can run on any of P0/1/3).

Since we're ORing everything together, it's fine to have zeros instead of redundant compare results in the high half.

As I noted in comments, IDK how to safely write this with intrinsics. The obvious way would be to use _mm256_castsi128_si256 (_mm_cmpeq_epi32(v, hilo)), but that technically leaves the high lane undefined, rather than zero. There's no sane way a compiler would do anything other than use the full-width ymm register that contains the xmm register with the 128b compare result, but it would be legal according to Intel's docs for a Deathstation-9000 compiler to put garbage there. Any explicit way of getting zeros in the high half would depend on the compiler optimizing it away. Maybe _mm256_setr_si128(cmpresult, _mm_setzero_si128());.

There are no current CPUs with AVX512F but not AVX512CD. But if that combo is interesting or relevant, clang makes some interesting asm from my code with -mavx512f -mavx512vl. It uses EVEX vpcmpeqd into mask registers, and korw to merge them. But then it expands that back into a vector to set up for vpmovmaskb, instead of just optimizing away the movemask and using the korw result. /facepalm.

I see you also ran into that Clang problem, using ADD instead of OR worked to get around that for me. Weird problem.. — harold, Jul 01 '17 at 13:17
Thank you very much for this answer. I will validate it in the next days and then accept it! I basically gather a lot of values than do some long bit hacking on them (which I'd like to do in parallel) and then want to write the values back to memory. If I write them back without checking for conflict then I might override changes that were made earlier. E.g. lets say the element 0 in the vector modifies bits 0-15 and the element 2 might modify 16-31 then the second store to memory will revert the changes on bit 0-15. — Christoph Diegelmann, Jul 02 '17 at 15:40
@Christoph: ok, I think I can see the kind of thing you're describing. I guess you might want to branch on `any_conflicts(v)` before even doing the SIMD bit-manipulation then, depending on how you implement a scalar fallback. — Peter Cordes, Jul 03 '17 at 01:23
I will try a few different things once I got it up and running and report back my findings in some form (any suggestions? Answer my own question?). Hopefully I don't have an average of like 1 or 2 conflicts. — Christoph Diegelmann, Jul 03 '17 at 05:50
@Christoph: sure, if you want to write up some details of how you traded off extra fallback work vs. speed of detection, a self-answer would be a good place. Especially if you came up with any new tricks I didn't already suggest. If that looks like a lot of work, feel free to just leave a comment or something. :) — Peter Cordes, Jul 03 '17 at 05:56
As long as I have the feeling it worth it I'm gonna do it (and 8 upvotes on a avw question is quite a nice indicator I guess) might take some time though. Maybe "hashing" down the 32 bit values into 16 bits and risking a few false positives might speed things up by not crossing lanes ? — Christoph Diegelmann, Jul 03 '17 at 05:59
@Christoph: If it turns out that conflicts are more common than you'd like, maybe try doing the detection in two 128-bit halves (if that lets you only do the fallback for one half). With different shuffles, you can do two in-lane conflict detections in parallel. You can probably share some of the shuffle+compare work between a full 8-wide conflict result and a pair of 4-wide in-lane conflict results, so you can get both more not much more cost than one or the other. — Peter Cordes, Jul 03 '17 at 06:00
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/148187/discussion-between-christoph-and-peter-cordes). — Christoph Diegelmann, Jul 03 '17 at 06:01
@Christoph: Lane-crossing shuffles have a bit of extra latency, but no throughput penalty (other than write-back conflicts from mixed-latency instruction sequences). There's probably enough independent work for each vector that out-of-order can hide the latency. — Peter Cordes, Jul 03 '17 at 06:05
Found that chat transcript while looking for some of my old stuff with AVX512. FWIW, it's been confirmed that all the Skylake X SKUs have the port5 FMA - including the 6 core one. So if you easily build yourself a minimal system with full AVX512 for under $1000 to play with. /cc @Christoph — Mysticial, Aug 01 '17 at 20:45
The Skylake Purley core has a "512-bit" mode that is enabled when any 512-bit instruction is in the reservation stations. In that mode, the port0/port1 vector units merge and port1 is shut off to all vector instructions. In 256-bit mode, the throughput for integer vectors is 3 x 256/cycle. In 512-bit mode, it's 2/cycle regardless of vector length. So 2 x 512-bit is the max. The port5 FMA has higher latency than the port0/1 FMA. — Mysticial, Aug 01 '17 at 20:49
The biggest bottleneck I've found is the L3 cache. It's 2-3 less bandwidth than Haswell-E which makes it barely faster than ram. Pretend it doesn't exist if you're tuning for cache sizes. The L1 and L2's are fine as they've doubled up in bandwidth from Haswell/Broadwell-E. — Mysticial, Aug 01 '17 at 20:49
Ah, p1 shutdown explains why AVX512 isn't as big a speedup as expected in something I vectorized a few months ago. — Peter Cordes, Aug 01 '17 at 20:52
@Mysticial: So L2 bandwidth is 64B per clock on SKL-X? Is that only when using ZMM loads? Because otherwise L1 and L2 bandwidth would be about the same for YMM loads, and that's not what I remember finding. — Peter Cordes, Aug 01 '17 at 20:53
@PeterCordes I'm not sure on the specifics. I'm deriving my findings based on the AIDA64 numbers as well as high-level benchmarks of my own code. The crappiness of the L3 explains why my code starts slowing down drastically long before it spills into memory. — Mysticial, Aug 01 '17 at 20:55
L3 is a lot slower than L2 in many-core Haswell, too. For example, STREAM benchmarks on google-cloud VM Haswell (I think dual socket 18 core) and Skylake-X (I think dual socket 28 core) showed somewhat lower single-core L3 bandwidth on Skylake than Haswell, but both suck compared to my desktop quad-core Skylake's L3 BW (even accounting for differences in clock speed). Many-core parts have higher uncore latency, and single-core bandwidth is limited by how much concurrency it can keep in flight. (See the latency-bound part of https://stackoverflow.com/a/43574756/224132) — Peter Cordes, Aug 01 '17 at 21:23
Found [a single-threaded SiSoft Sandra benchmark here](http://techreport.com/review/32111/intel-core-i9-7900x-cpu-reviewed-part-one/4), showing i9-7900X bandwidth falls off a cliff after 1MB, with single-threaded L3 bandwidth only a tiny bit faster than RAM, unlike BDW-E i7-6950X or HSW-E. — Peter Cordes, Aug 01 '17 at 21:47
Looks like L2 bandwidth is half L1 in that AVX2 test (update: or is it using AVX512, too? That would explain it beating a presumably higher-clocked i7-7700k. Stupid graphs using seconds instead of core clocks.). [An AVX512 test](http://www.sisoftware.eu/2017/06/23/intel-core-i9-skl-x-review-and-benchmarks-cpu-avx512-is-here/) found L2 bandwidth was still nearly half L1D with AVX512, so I guess you need AVX512 to get full L2 bandwidth on SKL-X, unless both tests were actually using AVX512. — Peter Cordes, Aug 01 '17 at 21:53
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/150766/discussion-between-mysticial-and-peter-cordes). — Mysticial, Aug 01 '17 at 21:59
+1 for (among other things) excellent "bounded" analysis of the optimal asm, starting with an information theoretic (more or less) bound on the number of comparisons and following through with asm that achieves it. — BeeOnRope, Aug 01 '17 at 23:53

Fallback implementation for conflict detection in AVX2

1 Answers1

Compiler asm output vs. optimal asm

Linked