How do the Conflict Detection instructions make it easier to vectorize loops?

Question

The AVX512CD instruction families are: VPCONFLICT, VPLZCNT and VPBROADCASTM.

The Wikipedia section about these instruction says:

The instructions in AVX-512 conflict detection (AVX-512CD) are designed to help efficiently calculate conflict-free subsets of elements in loops that normally could not be safely vectorized.

What are some examples that show these instruction being useful in vectorizing loops? It would be helpful if answers will include scalar loops and their vectorized counterparts.

Thanks!

[What Is AVX-512 and Why Is Intel Killing It Off?](https://www.makeuseof.com/what-is-avx-512-why-intel-killing-it/), [Linus Torvalds: 'I Hope AVX512 Dies a Painful Death'](https://www.extremetech.com/computing/312673-linus-torvalds-i-hope-avx512-dies-a-painful-death). [Conflict Detection-based Run-Length Encoding — AVX-512 CD Instruction Set in Action](https://hardbd-active.github.io/2018/papers/UngerthumPDHL-hardbd-active-18.pdf). — Richard Chambers, Aug 07 '23 at 12:16

Paul R · Accepted Answer · 2016-10-07T11:51:08.680

15

One example where the CD instructions might be useful is histogramming. For scalar code histogramming is just a simple loop like this:

load bin index
load bin count at index
increment bin count
store updated bin count at index

Normally you can't vectorize histogramming because you might have the same bin index more than once in a vector - you might naïvely try something like this:

load vector of N bin indices
perform gathered load using N bin indices to get N bin counts
increment N bin counts
store N updated bin counts using scattered store

but if any of the indices within a vector are the same then you get a conflict, and the resulting bin update will be incorrect.

So, CD instructions to the rescue:

load vector of N bin indices
use CD instruction to test for duplicate indices
set mask for all unique indices
while mask not empty
    perform masked gathered load using <N bin indices to get <N bin counts
    increment <N bin counts
    store <N updated bin counts using masked scattered store
    remove non-masked indices and update mask
end

In practice this example is quite inefficient and no better than scalar code, but there are other more compute-intensive examples where using the CD instructions seems to be worthwhile. Typically these will be simulations where the data elements are going to be updated in a non-deterministic fashion. One example (from the LAMMPS Molecular Dynamics Simulator) is referred to in the KNL book by Jeffers et al.

edited Oct 07 '16 at 11:51

answered Oct 07 '16 at 11:39

Paul R

208,748
37
389
560

2

See also [page 50 of Kirill Yukhin's slides from 2014](https://gcc.gnu.org/wiki/cauldron2014?action=AttachFile&do=get&target=Cauldron14_AVX-512_Vector_ISA_Kirill_Yukhin_20140711.pdf#page=50). I posted some stuff about speeding up histogram-type problems with/without AVX512CD [on a recent question](https://stackoverflow.com/questions/39266476/how-to-speed-up-this-histogram-of-lut-lookups). The usual trick of cloning your bins and totalling at the end can help avoid conflicts (and store-forwarding data dependency bottlenecks even for scalar). – Peter Cordes Oct 07 '16 at 16:26
Thanks Peter - I hadn't seen those slides before - some interesting tidbits in there... – Paul R Oct 07 '16 at 16:35
1

Section "16.2.3 Using AVX-512CD Instructions" of the [Intel optimization manuals](http://www.intel.com/content/www/us/en/architecture-and-technology/64-ia-32-architectures-optimization-manual.html) describes this exactly. See example "Example 16-4. Vectorized Histogram Update Using AVX-512CD". Is that where you got this idea? How do you know it's not really better than scalar? Have you tried it in practice with KNL? – Z boson Dec 27 '16 at 18:06
@Zboson: No, I don't think I'd seen that (I think my main reference was the Jeffers book linked to above), but I did benchmark some of this stuff on KNL a few months back and it didn't seem to be too impressive. I ran out of time though, so may revisit it when I get a chance. – Paul R Dec 27 '16 at 20:25
1

Maybe you can put your code here? I am thinking of trying to implement this with KNL. When did you get access to KNL? – Z boson Dec 27 '16 at 20:52
@Zboson: it's work stuff unfortunately, so I can't share code, but it's not too different than the examples referenced above. I have access to a KNL system at work, but very little time to experiment currently. – Paul R Dec 27 '16 at 20:54
So I just started implementing my own version with AVX512 and I just realized I don't know what happens when you scatter to the same address. I mean if and of the values in `vindex` are identical. With OpenMP this would create a race condition. – Z boson Dec 30 '16 at 14:35
A little test shows that it uses the value of the last element just like you would expect if you wrote to the same address sequentially. E.g. if you had an array `a = {0,1,2,3}` and you loaded that into an SSE register `va = _mm_load_epi32(a)` and then scattered that to the same address the address would contain 3. – Z boson Dec 30 '16 at 14:55
I'm not sure I understand - the point of the CD instructions is that you mask out any stores to identical indices so that there is no conflict. Or are you trying something different ? – Paul R Dec 30 '16 at 15:05
@PaulR, I am asking a general question. E.g what if you did not have the CD instructions (or you just did not use them). What can you expect if you scatter to the same address? Is this defined behavior. E.g. the scatter is broken into multiple micro-ops. Is there a guarantee which order they will write? – Z boson Jan 02 '17 at 09:46
@Zboson: ah - OK - my gut tells me this is probably going to be undefined behaviour - you can probably see what happens for a given CPU but there will be no guarantees that it will always work. – Paul R Jan 02 '17 at 09:53
What about general writes to the same address with OoO? If I write to two different addresses I assume the OoO engine can reorder them however it thinks best. But writing to the same address should be done in order I think? What about writing two difference addresses in the same cache line? Is it even possible to write to two different addresses at the same time with x86? My x86 knowledge is rusty so I'm just try to catch up. Do you think this is worth a SO question? – Z boson Jan 02 '17 at 10:05
Sure - it should make an interesting question. – Paul R Jan 02 '17 at 13:08

How do the Conflict Detection instructions make it easier to vectorize loops?

1 Answers1

Linked

Related