How to pack __m128i elements using masks?

Question

I have the following:

int j0 = 190;
int j1 = 191;
int j2 = 192;
int j3 = 193;
__m128i jv = _mm_set_epi32(j3, j2, j1, j0);
__m256d rij = _mm256_set_pd(2.8, 1.8, 2.1, 3.4);
__m256d sij = _mm256_set1_pd(2.5);
__m256d mask = sij - rij;

From the information from mask, I would like to pack integers which satisfy rij < sij. In the above example, the desired return is

[X, X, 192, 191],

where X means we do not care what the value is.

How do I get this result using AVX2 intrinsics?

Thanks.

Peter Cordes · Answer 1 · 2022-06-12T02:55:05.140

Probably just a 16-entry lookup table of __m128i shuffle-control vectors for _mm_shuffle_epi8. (Indexed by _mm256_movemask_pd of course; normally you'd want to use a compare instead of a sij - rij subtract, although if you also need the subtract result, you could use its sign bit instead of comparing separately into a vector of masks).

See AVX2 what is the most efficient way to pack left based on a mask?

You only need to get tricky when you have more than 4 elements. You could save LUT space by loading with vpmovzxbd for vpermilps, but that would take 2 shuffles per compare.

Of course if you had AVX-512 you could do a double compare and use the resulting __mmask4 or __mmask8 with vpcompressd to pack 32-bit elements, but before AVX-512 you don't have this primitive operation in hardware and it's expensive to emulate.

How to pack __m128i elements using masks?

1 Answers1