0

I have the following:

int j0 = 190;
int j1 = 191;
int j2 = 192;
int j3 = 193;
__m128i jv = _mm_set_epi32(j3, j2, j1, j0);
__m256d rij = _mm256_set_pd(2.8, 1.8, 2.1, 3.4);
__m256d sij = _mm256_set1_pd(2.5);
__m256d mask = sij - rij;

From the information from mask, I would like to pack integers which satisfy rij < sij. In the above example, the desired return is

[X, X, 192, 191],

where X means we do not care what the value is.

How do I get this result using AVX2 intrinsics?

Thanks.

1 Answers1

2

Probably just a 16-entry lookup table of __m128i shuffle-control vectors for _mm_shuffle_epi8. (Indexed by _mm256_movemask_pd of course; normally you'd want to use a compare instead of a sij - rij subtract, although if you also need the subtract result, you could use its sign bit instead of comparing separately into a vector of masks).

See AVX2 what is the most efficient way to pack left based on a mask?

You only need to get tricky when you have more than 4 elements. You could save LUT space by loading with vpmovzxbd for vpermilps, but that would take 2 shuffles per compare.


Of course if you had AVX-512 you could do a double compare and use the resulting __mmask4 or __mmask8 with vpcompressd to pack 32-bit elements, but before AVX-512 you don't have this primitive operation in hardware and it's expensive to emulate.

Peter Cordes
  • 328,167
  • 45
  • 605
  • 847