Analyzing results of comparision in AVX2

Question

I'm trying to optimize code with AVX2 assembly. At one point I need to compare the result of computation with the threshold and write 0 or 255 byte to output. I compare with

VCMPPD ymm2,ymm1 (values here),ymm4 (thresholds here),1

Then, ymm2 holds 4 QUADWORDS 0 and 0xFF. Ideally would be to shrink it all to 4 byte in EAX. But now, I'm doing 4 VPTEST operations and several conditional jumps to form the output. This slows down the performance significantly.

Question: how to get and use the result of comparison with AVX2 effectively?

Is `VPMOVMSKB eax, ymm2` what you're looking for? – Iwillnotexist Idonotexist Nov 09 '20 at 04:20 — Iwillnotexist Idonotexist, Nov 09 '20 at 04:20

score 4 · Answer 1 · answered Nov 09 '20 at 04:23

4

You're probably looking for vmovmskpd eax, ymm2 (manual entry) to get a 4-bit bitmap which you can then analyze with integer instructions like test eax,eax. or cmp al, 0xf to check if all elements were true, or even as an index for a jump table like jmp [table + rax*8] if you need finer detail of which elements were true.

You could of course use vpmovmskb if you actually want 8 identical bits from each vector element, one from each byte.

If you didn't already know about movmskps/pd and so on, I'd suggest Agner Fog's optimization guide: he has a chapter about SIMD. https://agner.org/optimize/. vector -> bitmap is one of x86's best features.

answered Nov 09 '20 at 04:23

Peter Cordes

328,167
45
605
847

While this is certainly correct, often the thing the SIMD programmer should really be asking is why the data needs to be moved back to the condition registers / scalar domain in the first place? If possible, it may be preferable to use the vector compare results as is in a branch free vector algorithm downstream and skip the branchiness all together. However, this would mean asking the age old question ”What are you really trying to do?”, which 3 years after the fact seems unlikely to bear fruit. – Ian Ollmann Aug 31 '23 at 07:00
@IanOllmann: Yeah, looking at this again, it's not clear what they're really trying to do. Maybe they can just pack 8 vectors of qword compare results down one of bytes using `vpacksswb` which preserves the high bit of each word (and keeps them as 0 / -1). (And an eventual byte-shuffle or something to rearrange, since in-lane pack instructions work on the 128-bit halves separately.) – Peter Cordes Aug 31 '23 at 07:10
Or at least a scalar lookup table of data based on the 4-bit mask, such as a shuffle-control vector for `vpermpd` or something, if left-packing the original elements. (See [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) - 4 elements is few enough for a LUT of shuffle vectors to be good, unlike with 8 elements in a `__m256`.) IDK why I only mentioned indexing a jump table when I wrote this answer! – Peter Cordes Aug 31 '23 at 07:12

Analyzing results of comparision in AVX2

1 Answers1