How to use AVX intrinsics to compare two vectors of packed double precision in C

Question

I would like to use _mm512_mask_cmple_pd_mask to compare two packed double precision vectors. My issue is that the result comes as __mmask8 type...

So I guess that my question is how I convert such mask into packed integer vectors, so I can use the result of the comparison later on.

In my particular case, I need to know how many Trues are, so I will need to do some sort of reduction afterwards... but one thing at the time!

score 2 · Answer 1 · answered Sep 13 '22 at 21:35

You use __mmask8 with other AVX-512 intrinsics, like _mm512_maskz_add_pd (__mmask8 k, __m512d a, __m512d b); to do a zero-masking add, producing 0.0 where the mask was zero, and the normal result where the mask was one.

To count matches, _popcnt32(mask) works; __mmask8 can implicitly convert to/from integer types. (In fact it's just a typedef for uint8_t in existing implementations.)

But are you sure you want AVX-512 masking? You only tagged your question [avx], and that's a new feature in AVX-512. Without AVX-512, you'd use AVX1 _mm256_cmp_pd(a,b, _CMP_EQ_OQ) to get a vector, and AVX1 _mm256_movemask_pd on that to get an int bitmap.

Or if you're doing this over multiple vectors, use integer subtraction to accumulate a count, like AVX2 counts = _mm256_sub_epi64(counts, cmp_result);, then hsum that at the end. (A compare result vector has elements with all-0 or all-1 bits, i.e. integer 0 or -1). See

Fastest way to do horizontal SSE vector sum (or other reduction)
Fastest method to calculate sum of all packed 32-bit integers using AVX512 or AVX2 is pretty trivial to modify for 64-bit elements, still start by reducing to __m128i with _mm_add_epi64(_mm256_extracti128_si256(counts,1), _mm256_castsi256_si128(counts)).

score 2 · Answer 2 · edited Sep 14 '22 at 08:34

Thanks to Peter that point me in the right direction!
I was not aware of _popcnt32.

Also, I was trying to use the wrong intrinsic, the one that does the job is _mm512_cmple_pd_mask.

For reference, I post below the test code that accomplishes what I need.

#include <stdio.h>
#include <immintrin.h>
#include <cstdint>

#define p_size 8

int main(){
    double  x[p_size] __attribute__((aligned(64)));
    double  v[p_size] __attribute__((aligned(64)));

    // Put some values in for testing
    for (int i = 0; i < p_size; i++){
        x[i] = 1.0*i;
        v[i] = 2.0*i-2.0;
    }

    // Get the correct result with a for loop:
    int jj = 0;
    for (int i = 0; i < p_size; i++){
        if (x[i] <= v[i]){jj++;}
    }

    // Now use AVX-512 to get the same information
    __m512d  xpd  = _mm512_load_pd(&x[0]);
    __m512d  vpd  = _mm512_load_pd(&v[0]);
    __mmask8 mask = _mm512_cmple_pd_mask (xpd, vpd);
    int ii = _popcnt32(mask);

    // Print results and check:
    printf("For Loop = %d, SIMD = %d\n",jj, ii);

    return 0;
}

`__attribute__((aligned(64)))` is mostly obsoleted by ISO C11/C++11 `alignas(64) double x[size];` (In C, `#include ` to get a #define alignas _Alignas). But yeah, that's how it's done. And yeah, `_mm512_mask_cmple_pd_mask` is only useful if you want to do a masked compare into another mask, to effectively AND the compare result with another mask. You could use it with a `-1` constant, and hope the compiler realizes it can just use unmasked, but it's clearer to use the no-masking intrinisc. — Peter Cordes, Sep 14 '22 at 07:34

How to use AVX intrinsics to compare two vectors of packed double precision in C

2 Answers2