AVX512 introduces optional zero-masking and merge-masking for almost all instructions.
Before that, to do a conditional add, mask one operand (with vandps
or vandnps
for the inverse) before the add (instead of vblendvps
on the result). This is why packed-compare instructions/intrinsics produce all-zero or all-one elements.
0.0
is the additive identity element, so adding it is a no-op. (Except for IEEE semantics of -0.0 and +0.0, I forget how that works exactly).
Masking a constant input instead of blending the result avoids making the critical path longer, for something like conditionally adding 1.0
.
Conditional multiply is more cumbersome because 0.0
is not the multiplicative identity. You need to multiply by 1.0
to keep a value unchanged, and you can't easily produce that with an AND or ANDN with a compare result. You can blendv an input, or you can do the multiply and blendv the output.
The alternative to blendv is at least 3 booleans, like AND/ANDN/OR, but that's usually not worth it. Although note that Haswell runs vblendvps
and vpblendvb
as 2 uops for port 5, so it's a potential bottleneck compared to using integer booleans that can run on any port. Skylake runs them vblendvps
as 2 uops for any port. It could make sense to do something to avoid having a blendv on the critical path, though.
Masking an input operand or blending the result is generally how you do branchless SIMD conditionals.
BLENDV is usually at least 2 uops, so it's slower than an AND.
Immediate blends are much more efficient, but you can't use them, because the imm8
blend control has to be a compile-time constant embedded into the instruction's machine code. That's what immediate means in an assembly-language context.