2

I'm trying to optimize code with AVX2 assembly. At one point I need to compare the result of computation with the threshold and write 0 or 255 byte to output. I compare with

VCMPPD ymm2,ymm1 (values here),ymm4 (thresholds here),1

Then, ymm2 holds 4 QUADWORDS 0 and 0xFF. Ideally would be to shrink it all to 4 byte in EAX. But now, I'm doing 4 VPTEST operations and several conditional jumps to form the output. This slows down the performance significantly.

Question: how to get and use the result of comparison with AVX2 effectively?

DbPro81
  • 21
  • 2

1 Answers1

4

You're probably looking for vmovmskpd eax, ymm2 (manual entry) to get a 4-bit bitmap which you can then analyze with integer instructions like test eax,eax. or cmp al, 0xf to check if all elements were true, or even as an index for a jump table like jmp [table + rax*8] if you need finer detail of which elements were true.

You could of course use vpmovmskb if you actually want 8 identical bits from each vector element, one from each byte.

If you didn't already know about movmskps/pd and so on, I'd suggest Agner Fog's optimization guide: he has a chapter about SIMD. https://agner.org/optimize/. vector -> bitmap is one of x86's best features.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
  • While this is certainly correct, often the thing the SIMD programmer should really be asking is why the data needs to be moved back to the condition registers / scalar domain in the first place? If possible, it may be preferable to use the vector compare results as is in a branch free vector algorithm downstream and skip the branchiness all together. However, this would mean asking the age old question ”What are you really trying to do?”, which 3 years after the fact seems unlikely to bear fruit. – Ian Ollmann Aug 31 '23 at 07:00
  • @IanOllmann: Yeah, looking at this again, it's not clear what they're really trying to do. Maybe they can just pack 8 vectors of qword compare results down one of bytes using `vpacksswb` which preserves the high bit of each word (and keeps them as 0 / -1). (And an eventual byte-shuffle or something to rearrange, since in-lane pack instructions work on the 128-bit halves separately.) – Peter Cordes Aug 31 '23 at 07:10
  • Or at least a scalar lookup table of data based on the 4-bit mask, such as a shuffle-control vector for `vpermpd` or something, if left-packing the original elements. (See [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) - 4 elements is few enough for a LUT of shuffle vectors to be good, unlike with 8 elements in a `__m256`.) IDK why I only mentioned indexing a jump table when I wrote this answer! – Peter Cordes Aug 31 '23 at 07:12