11

I was studying the expand and compress operations from the Intel intrinsics guide. I'm confused about these two concepts:

For __m128d _mm_mask_expand_pd (__m128d src, __mmask8 k, __m128d a) == vexpandpd

Load contiguous active double-precision (64-bit) floating-point elements from a (those with their respective bit set in mask k), and store the results in dst using writemask k (elements are copied from src when the corresponding mask bit is not set).

For __m128d _mm_mask_compress_pd (__m128d src, __mmask8 k, __m128d a) == vcompresspd

Contiguously store the active double-precision (64-bit) floating-point elements in a (those with their respective bit set in writemask k) to dst, and pass through the remaining elements from src.

Is there any clearer description or anyone who can explain more?

Alexis Wilke
  • 19,179
  • 10
  • 84
  • 156
Amiri
  • 2,417
  • 1
  • 15
  • 42
  • 6
    They're the vector equivalent of BMI2 [`pdep`](http://felixcloutier.com/x86/PDEP.html) and `pext`. See my AVX512 answer on [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) for how to use them. (And for how to use `pext` / `pdep` to generate shuffle masks on the fly for AVX2.) – Peter Cordes Jul 09 '18 at 10:27
  • 3
    Note that Peter Cordes' left packing answer shows how to emulate instruction `vcompressps`. On the other hand, the function `_mm256_mask_expand_epi32_AVX2_BMI` in my answer [here](https://stackoverflow.com/a/48178813) emulates the `vpexpandd` instruction. – wim Jul 09 '18 at 15:50
  • 1
    @PeterCordes - I _wish_ they were the vectorized versions of `pdep` and `pext`! The main difference is that `pdep` and `pext` are bit-wise, but even for future version of AVX-512, the compress instructions are at their finest only byte-wise. I doubt we will see the bit-wise versions any time soon: the hardware to do such bit granularity compression is considerable and doesn't overlap much with the other shuffle units. – BeeOnRope Jul 09 '18 at 19:24

1 Answers1

11

These instructions implement the APL operators \ (expand) and / (compress). Expand takes a bit mask α of some mn bits of which n are set and an array ω of n numbers and returns a vector of m numbers with the numbers from ω inserted into the places indicated by α and the rest set to zero. For example,

0 1 1 0 1 0 \ 2 3 4

returns

0 2 3 0 4 0

The _mm_mask_expand_pd instruction implements this operator for fixed m = 8.

The compress operation undos the effect of the expand operation, i.e. it uses a bit mask α to select entries from ω and stores these entries contiguously to memory.

fuz
  • 88,405
  • 25
  • 200
  • 352