See also some AVX512 histogram links and info I dug up a while ago in this answer.
I think the basic idea is to scatter the conflict-free set of elements, then re-gather, re-process, and re-scatter the next conflict-free set of elements. Repeat until there are no more conflicts.
Note that the first appearance of a repeated index is a "conflict-free" element, according to vpconflictd
, so a simple repeat loop makes forward progress.
Steps in this process:
Turn a vpconflictd
result into a mask that you can use with a gather instruction: _mm512_testn_epi32_mask
(as suggested by @veritas) against a vector of all-ones looks good for this, since you need to invert it. You can't just test it against itself.
Remove the already-done elements: vpcompressd
is probably good for this. We can even fill up the "empty" spaces in our vector with new elements, so we don't re-run the gather / process / scatter loop with most of the elements masked.
For example, this might work as a histogram loop, if I'm doing this right:
// probably slow, since it assumes conflicts and has a long loop-carried dep chain
// TOTALLY untested.
__m512i all_ones = _mm512_set1_epi32(-1); // easy to gen on the fly (vpternlogd)
__m512i indices = _mm512_loadu_si512(p);
p += 16;
// pessimistic loop that assumes conflicts
while (p < endp) {
// unmasked gather, so it can run in parallel with conflict detection
__m512i v = _mm512_i32gather_epi32(indices, base, 4);
v = _mm512_sub_epi32(gather, all_ones); // -= -1 to reuse the constant.
// scatter the no-conflict elements
__m512i conflicts = _mm512_conflict_epi32(indices);
__mmask16 knoconflict = _mm512_testn_epi32_mask(conflicts, all_ones);
_mm512_mask_i32scatter_epi32(base, knoconflict, indices, v, 4);
// if(knoconflict == 0xffff) { goto optimistic_loop; }
// keep the conflicting elements and merge in new indices to refill the vector
size_t done = _popcnt32(knoconflict);
p += done; // the elements that overlap will be replaced with the conflicts from last time
__m512i newidx = _mm512_loadu_si512(p);
// merge-mask into the bottom of the newly-loaded index vector
indices = _mm512_mask_compress_epi32(newidx, ~knoconflict, indices);
}
We end up needing the mask both ways (knoconflict
and ~knoconflict
). It might be best to use _mm512_test_epi32_mask(same,same)
and avoid the need for a vector constant to testn
against. That might shorten the loop-carried dependency chain from indices in mask_compress, by putting the inversion of the mask onto the scatter
dependency chain. When there are no conflicts (including between iterations), the scatter is independent.
If conflicts are rare, it's probably better to branch on it. This branchless handling of conflicts is a bit like using cmov
in a loop: it creates a long loop-carried dependency chain.
Branch prediction + speculative execution would break those chains, and allow multiple gathers / scatters to be in flight at once. (And avoid running popcnt
/ vpcompressd
at all when the are no conflicts).
Also note that vpconflictd
is slow-ish on Skylake-avx512 (but not on KNL). When you expect conflicts to be very rare, you might even use a fast any_conflicts()
check that doesn't find out where they are before running the conflict-handling.
See Fallback implementation for conflict detection in AVX2 for a ymm
AVX2 implementation, which should be faster than Skylake-AVX512's micro-coded vpconflictd ymm
. Expanding it to 512b zmm vectors shouldn't be difficult (and might be even more efficient if you can take advantage of AVX512 masked-compare into mask to replace a boolean operation between two compare results). Maybe with AVX512 vpcmpud k0{k1}, zmm0, zmm1
with a NEQ predicate.