1

Can you give the list of conditional instructions available in AVX2? So far I've found the following:

  • _mm256_blendv_* for selection from a and b based on mask c

Are there something like conditional multiply and conditional add, etc.?

Also if instructions taking imm8 count (like _mm256_blend_*), could you explain how to get that imm8 after a vector comparision?

Serge Rogatch
  • 13,865
  • 7
  • 86
  • 158
  • [All AVX2 intrinsics are here](https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=AVX2). Is that what you're asking for in the first part of the question? We don't really do "list"-style questions here. – Cody Gray - on strike Aug 23 '17 at 18:18
  • @CodyGray, I'm using those pages actively, but couldn't find anything more except what I've listed. Perhaps I've missed something. I expect the list to be short, maybe 5 items, maybe no more items. So I think the question isn't too broad. And there is no need to describe what the intrinsics are doing (that can be read from the documentation, once I know how to search for them). Just how to use them in principle like the versions taking `imm8`. – Serge Rogatch Aug 23 '17 at 18:23

2 Answers2

3

Intel Intrinsics Guide suggests gather, load and store operating with a mask. The immediate imm8 in blend_epi16 is not programmable unless self-modifying code or a jump table is considered an option. It's still possible to derive using pext from BMI2 to compact half of odd positioned bits from the result of movemask -- one gets 32 independent mask bits from movemask in AVX2, but blend_epi16 uses each bit to control four bytes--or one 16-bit variable in each bank.

Aki Suihkonen
  • 19,144
  • 1
  • 36
  • 57
2

AVX512 introduces optional zero-masking and merge-masking for almost all instructions.

Before that, to do a conditional add, mask one operand (with vandps or vandnps for the inverse) before the add (instead of vblendvps on the result). This is why packed-compare instructions/intrinsics produce all-zero or all-one elements.

0.0 is the additive identity element, so adding it is a no-op. (Except for IEEE semantics of -0.0 and +0.0, I forget how that works exactly).

Masking a constant input instead of blending the result avoids making the critical path longer, for something like conditionally adding 1.0.


Conditional multiply is more cumbersome because 0.0 is not the multiplicative identity. You need to multiply by 1.0 to keep a value unchanged, and you can't easily produce that with an AND or ANDN with a compare result. You can blendv an input, or you can do the multiply and blendv the output.

The alternative to blendv is at least 3 booleans, like AND/ANDN/OR, but that's usually not worth it. Although note that Haswell runs vblendvps and vpblendvb as 2 uops for port 5, so it's a potential bottleneck compared to using integer booleans that can run on any port. Skylake runs them vblendvps as 2 uops for any port. It could make sense to do something to avoid having a blendv on the critical path, though.

Masking an input operand or blending the result is generally how you do branchless SIMD conditionals.

BLENDV is usually at least 2 uops, so it's slower than an AND.

Immediate blends are much more efficient, but you can't use them, because the imm8 blend control has to be a compile-time constant embedded into the instruction's machine code. That's what immediate means in an assembly-language context.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847