How to do mask / conditional / branchless arithmetic operations in AVX2

Question

I understand how to do general arithmetic operations in AVX2. However, there are conditional operations in scalar code I would like to translate to AVX2. How shall I do it? For example, I would like to vectorize

double arr[4] = {1.0,2.0,3.0,4.0};
double condition = 3.0;
for (int i = 0; i < 4; i++) {
    if (arr[i] < condition) {
        arr[i] *= 1.75;
    }
    else {
        arr[i] *= 6.5;
    }
}
for (auto i : arr) {
    std::cout << i << '\t';
}

Expected output:

1.75 3.5 19.5 26

How can I perform conditional operations like above in AVX2?

Vladislav Kogan · Accepted Answer · 2022-12-12T02:37:35.553

Use AVX2 conditional operations. Calculate both possible outputs on whole vectors. After that save those particular results that satisfy your conditions (mask). For your case:

double arr[4] = { 1.0,2.0,3.0,4.0 };
double condition = 3.0;
__m256d vArr = _mm256_loadu_pd(&arr[0]);
__m256d vMultiplier1 = _mm256_set1_pd(1.75);
__m256d vMultiplier2 = _mm256_set1_pd(6.5); 
__m256d vFirstResult = _mm256_mul_pd(vArr, vMultiplier1); //if-branch
__m256d vSecondResult = _mm256_mul_pd(vArr, vMultiplier2); //else-branch
__m256d vCondition = _mm256_set1_pd(condition);

vCondition= _mm256_cmp_pd(vArr, vCondition, _CMP_LT_OQ); //a < b ordered (non-signalling)
// Use mask to choose between _firstResult and _secondResult for each element
vFirstResult = _mm256_blendv_pd(vSecondResult, vFirstResult, vCondition);

double res[4];
_mm256_storeu_pd(&res[0], vFirstResult);
for (auto i : res) {
    std::cout << i << '\t';
}

Possible alternative approach instead of BLENDV is combination of AND, ANDNOT and OR. However BLENDV is much better both in simplicity and performance. Use BLENDV as long as you have as least SSE4.1 and don't have AVX512 yet.

For information about what _CMP_LT_OQ mean and can be see Dave Dopson's table. You can do whatever comparisons you want changing this accordingly.

There are detailed notes by Peter Cordes about conditional operations in AVX2 and AVX512. There are more examples on conditional vectorization (with SSE and AVX512 examples) in Agner Fog's "Optimizing C++" in chapter 12.4 on pages 121-124.

Maybe you aren't want to do some computations in else-branch or explicitly want to zero it. So that your expected output will look like

1.75    3.5     0.0       0.0

It that case you can you a bit more faster instruction sequence since you're not have to think about else-branch. There are at least 2 ways to achieve speedup:

Removing second multiplication, but keeping blendv. Instead of _secondResult just use zeroed vector (it can be global const).
Removing both second multiplication and blendv, and replacing blendv with AND mask. This variant uses zeroed vector as well.

Second way will be better. For example, according to uops table VBLENDVB on Skylake microarchitecture takes 2 uops, 2 clocks latency and can be done only once per clock. Meanwhile VANDPD have 1 uops, 1 clock latency and can be done 3 times in a single clock.

Worse way, just blending with zero

double arr[4] = { 1.0,2.0,3.0,4.0 };
double condition = 3.0;
__m256d vArr = _mm256_loadu_pd(&arr[0]);
__m256d vMultiplier1 = _mm256_set1_pd(1.75);
__m256d vFirstResult = _mm256_mul_pd(vArr, vMultiplier1); //if-branch
__m256d vZeroes = _mm256_setzero_pd();
__m256d vCondition = _mm256_set1_pd(condition);

vCondition = _mm256_cmp_pd(vArr, vCondition, _CMP_LT_OQ); //a < b ordered (non-signalling)
//Conditionally blenv _firstResult when IF statement satisfied, zeroes otherwise
vFirstResult = _mm256_blendv_pd(vZeroes, vFirstResult, vCondition);
double res[4];

_mm256_storeu_pd(&res[0], vFirstResult);
for (auto i : res) {
    std::cout << i << '\t';
}

Better way, bitwise AND with a compare result is a cheaper way to conditionally zero.

double arr[4] = { 1.0,2.0,3.0,4.0 };
double condition = 3.0;
__m256d vArr = _mm256_loadu_pd(&arr[0]);
__m256d vMultiplier1 = _mm256_set1_pd(1.75);
__m256d vFirstResult = _mm256_mul_pd(vArr, vMultiplier1); //if-branch
__m256d vCondition = _mm256_set1_pd(condition);

vCondition = _mm256_cmp_pd(vArr, vCondition, _CMP_LT_OQ); //a < b ordered (non-signalling)
// If result not satisfied condition, after bitwise AND it becomes zero
vFirstResult = _mm256_and_pd(vFirstResult, vCondition);

double res[4] = {0.0,0.0,0.0,0.0};
_mm256_storeu_pd(&res[0], vFirstResult);
for (auto i : res) {
    std::cout << i << '\t';

This takes advantage of what a compare-result vector really is, and that the bit-pattern for IEEE 0.0 is all bits zeroed.

Often you'd blend the multiplier and only do one `_mm256_mul_pd`. That saves an instruction. Your way has lower critical-path latency since the compare and multiplies can run in parallel, but typical SIMD use-cases are more sensitive to throughput; unrolling can hide latency if necessary. — Peter Cordes, Nov 16 '22 at 04:40
I notice you edited out an alternate version after comments under the question complained about lack of focus. I think showing `and` or `andnot` to conditionally zero something is a reasonable part of an *answer*. The current question is good and focused, but there's room in an answer to cover interesting variations. The fact that the bit-pattern for `0.0` is all-zero bits is very useful, especially with addition where you can conditionally add by masking a `set1` vector, so you add zero or something else. — Peter Cordes, Nov 16 '22 at 05:54
Updated based on your comments. I hope it will be useful to link to if someone will ask about AVX2, not SSE. — Vladislav Kogan, Nov 18 '22 at 02:22
You never want to use `_mm256_blendv_pd` when one of the inputs is `_mm256_setzero_pd` when the selector vector is a compare result. Always just use bitwise AND or ANDN in that case to conditionally zero. Single uop on all CPUs (except Zen1 etc), vs. some taking 2 uops for blendv (https://uops.info/). A `blendv` with zeros can make sense to use just the high bit of another vector, e.g. using the sign bits directly instead of comparing against zero first. — Peter Cordes, Nov 18 '22 at 02:35
(Oh, your answer is making that point, I was just skimming and saw the first code block being worse, which is usually not what you want.) — Peter Cordes, Nov 18 '22 at 02:43
For style, normally you'd use separate variables for the compare input vs. the mask result. Conceptually, they aren't *really* even the same type, it's more like an `__m256i` that happens to be in an `__m256d`. Agner Fog's VCL uses `Vec4d` (4 doubles) vs. `Vec4fb` (boolean result that goes with vectors of 4 doubles). You only ever want to use a compare result with bitwise ops and blend controls, and the compare threshold is a constant you'd often want to reuse in a loop. Also, `_mask` isn't a meaningful name for `set1_pd(3.0)`. — Peter Cordes, Nov 18 '22 at 02:51
Leading underscore names always seemed like bad style for variable names. Intrinsics already have leading underscores all over the place, so using it for your var names too makes more of a mess. And in some cases, leading-underscore names are reserved for the implementation. (Not at function scope with only one leading underscore, but it's pretty close to the rule for being reserved). If I have a scalar `a`, I might use `__m256d va = _mm256_set1_pd(a);`, where `v` means "vector". — Peter Cordes, Nov 18 '22 at 02:54

How to do mask / conditional / branchless arithmetic operations in AVX2

1 Answers1

Linked