Usage of AVX512-CD

Question

I am currently working with the KNL and try to understand the new opportunities of AVX512. Besides the extended register side, AVX512 comes along with new instruction sets. The conflict detection seems to be promising. The intrinsic

_mm512_conflict_epi32(...)

creates a vector register, containing a conflict free subset of the given source register:

As one can see, the first appearence of a value results in a 0 at the corresponding position within the result vector. If the value is present multiple times, the result register holds a zero-extended value. So far so good! BUT I wonder how one can utilize this result for further aggregations or computations. I read that one could use it along side a leading zeros count, but I don't think that is should be enough to determine the values of the subsets.

Does anyone know how one can utilize this result?

Sincerely

Here are some great discussions last year: https://stackoverflow.com/questions/39913707/how-do-the-conflict-detection-instructions-make-it-easier-to-vectorize-loops — veritas, Aug 22 '17 at 01:32
Thanks for the reply. But I didn't find an answer for my problem of building subsets. To do so, one should be able to count the trailing zeros for instance. But this seems to be not possible :( — Hymir, Aug 22 '17 at 12:47

score 3 · Accepted Answer · answered Aug 23 '17 at 00:38

Now I understand that your question is how to utilize results from VPCONFLICTD/Q to build subsets for further aggregations or computations ...

Using your own example:

conflict_input  = 
   [
  00000001|00000001|00000001|00000001|
  00000002|00000002|00000002|00000002|
  00000002|00000002|00000001|00000001|
  00000001|00000001|00000001|00000001
   ]

Applying VPCONFLICTD:

__m512i out = _mm512_conflict_epi32(in);

Now we get:

conflict_output = 
  [
  00000000|00000001|00000003|00000007|
  00000000|00000010|00000030|00000070|
  000000f0|000001f0|0000000f|0000040f|
  00000c0f|00001c0f|00003c0f|00007c0f
  ]
bit representation = 
  [
  ................|...............1|..............11|.............111|
  ................|...........1....|..........11....|.........111....|
  ........1111....|.......11111....|............1111|.....1......1111|
  ....11......1111|...111......1111|..1111......1111|.11111......1111
  ]

If you wish to get a mask based on first appearance of non-repeating value

const   __m512i set1 = _mm512_set1_epi32(0xFFFFFFFF);
const __mmask16 mask = _mm512_testn_epi32_mask(out, set1);

Now you can do all the usual stuff with the mmask16

[1000100000000000]

you can also compress it:

const __m512i out3 = _mm512_mask_compress_epi32(set0, mask, in);

[00000001|00000002|00000000|00000000|
 00000000|00000000|00000000|00000000|
 00000000|00000000|00000000|00000000|
 00000000|00000000|00000000|00000000]

There are lots of things you can do with the mask; However, I noticed interestingly the vplzcntd and don't know where I can use it:

const __m512i out1 = _mm512_conflict_epi32(in);
const __m512i out2 = _mm512_lzcnt_epi32(out1);

output2 = [
00000020|0000001f|0000001e|0000001d|
00000020|0000001b|0000001a|00000019|
00000018|00000017|0000001c|00000015|
00000014|00000013|00000012|00000011
          ]
        = [
..........1.....|...........11111|...........1111.|...........111.1|
..........1.....|...........11.11|...........11.1.|...........11..1|
...........11...|...........1.111|...........111..|...........1.1.1|
...........1.1..|...........1..11|...........1..1.|...........1...1
          ]

I'd love to know why `_mm512_lzcnt_epi32` is part of AVX512CD, rather than AVX512VBMI or something. It implies there's some use for it in conflict-detection, but IDK what! — Peter Cordes, Aug 26 '17 at 03:52

score 1 · Answer 2 · answered Aug 26 '17 at 03:49

See also some AVX512 histogram links and info I dug up a while ago in this answer.

I think the basic idea is to scatter the conflict-free set of elements, then re-gather, re-process, and re-scatter the next conflict-free set of elements. Repeat until there are no more conflicts.

Note that the first appearance of a repeated index is a "conflict-free" element, according to vpconflictd, so a simple repeat loop makes forward progress.

Steps in this process:

Turn a vpconflictd result into a mask that you can use with a gather instruction: _mm512_testn_epi32_mask (as suggested by @veritas) against a vector of all-ones looks good for this, since you need to invert it. You can't just test it against itself.
Remove the already-done elements: vpcompressd is probably good for this. We can even fill up the "empty" spaces in our vector with new elements, so we don't re-run the gather / process / scatter loop with most of the elements masked.

For example, this might work as a histogram loop, if I'm doing this right:

// probably slow, since it assumes conflicts and has a long loop-carried dep chain
// TOTALLY untested.
__m512i all_ones = _mm512_set1_epi32(-1);  // easy to gen on the fly (vpternlogd)
__m512i indices = _mm512_loadu_si512(p);
p += 16;

// pessimistic loop that assumes conflicts
while (p < endp) {
    // unmasked gather, so it can run in parallel with conflict detection
    __m512i v = _mm512_i32gather_epi32(indices, base, 4);
    v = _mm512_sub_epi32(gather, all_ones);              // -= -1 to reuse the constant.

    // scatter the no-conflict elements
    __m512i conflicts = _mm512_conflict_epi32(indices);
    __mmask16 knoconflict = _mm512_testn_epi32_mask(conflicts, all_ones);
    _mm512_mask_i32scatter_epi32(base, knoconflict, indices, v, 4);

    // if(knoconflict == 0xffff) { goto optimistic_loop; }

    // keep the conflicting elements and merge in new indices to refill the vector
    size_t done = _popcnt32(knoconflict);
    p += done;                 // the elements that overlap will be replaced with the conflicts from last time
    __m512i newidx = _mm512_loadu_si512(p);
    // merge-mask into the bottom of the newly-loaded index vector
    indices = _mm512_mask_compress_epi32(newidx, ~knoconflict, indices);
}

We end up needing the mask both ways (knoconflict and ~knoconflict). It might be best to use _mm512_test_epi32_mask(same,same) and avoid the need for a vector constant to testn against. That might shorten the loop-carried dependency chain from indices in mask_compress, by putting the inversion of the mask onto the scatter dependency chain. When there are no conflicts (including between iterations), the scatter is independent.

If conflicts are rare, it's probably better to branch on it. This branchless handling of conflicts is a bit like using cmov in a loop: it creates a long loop-carried dependency chain.

Branch prediction + speculative execution would break those chains, and allow multiple gathers / scatters to be in flight at once. (And avoid running popcnt / vpcompressd at all when the are no conflicts).

Also note that vpconflictd is slow-ish on Skylake-avx512 (but not on KNL). When you expect conflicts to be very rare, you might even use a fast any_conflicts() check that doesn't find out where they are before running the conflict-handling.

See Fallback implementation for conflict detection in AVX2 for a ymm AVX2 implementation, which should be faster than Skylake-AVX512's micro-coded vpconflictd ymm. Expanding it to 512b zmm vectors shouldn't be difficult (and might be even more efficient if you can take advantage of AVX512 masked-compare into mask to replace a boolean operation between two compare results). Maybe with AVX512 vpcmpud k0{k1}, zmm0, zmm1 with a NEQ predicate.

Usage of AVX512-CD

2 Answers2