Emulate AVX512 VPCOMPESSB byte packing without AVX512_VBMI2

Question

I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows will contain data.

I am looking for an instruction (or instructions) to do the same thing with a byte array in a zmm register as VPCOMPRESSQ does with a qword array in a zmm register.

Consider a situation where rows 3, 6, 7, 9 and 12 contain data and all the other rows (a total of 64 rows or fewer) are empty. Mask register k1 now contains 64 bits set to 001001001001000 ... 0.

With the VPCOMPRESSQ instruction, I can use k1 to compress the zmm register so that non-zero elements are arranged contiguously from the start of the register, but there is no equivalent instruction for bytes.

PSHUFB appears as a candidate, but the shuffle control mask must be integers, not bits. I can get an integer array 0 0 3 0 0 6, etc, but I still have elements set to zero, and I need them arranged contiguously from the start of a zmm register.

So my question is, given a zmm register with a byte array where k1 is set as shown above, how can I get the non-zero elements arranged contiguously from the beginning of the zmm register, like VPCOMPRESSQ does for qwords?

AVX512VBMI2 has VPCOMPRESSB, but that's only available in Ice Lake and later. What can be done on Skylake-avx512 / Cascade Lake?

You could expand 1 vector of bytes to 4 vectors of dwords so you can use `vpcompressd` (and right-shift the mask each time). Then you only have to handle the 3 boundaries between 16-element chunks when recombining. (Presumably repack down to 1-byte chunks for that, or maybe 2-byte for `vpermt2w`.) Or even store/reload if the extra latency is ok from letting the store buffer recombine multiple overlapping stores for one vector reload. — Peter Cordes, May 10 '20 at 19:38
When you say AVX, you mean AVX512, right? You're still talking about actually using the `k1` mask register. Because if you had to do this without AVX512, see [AVX2 what is the most efficient way to pack left based on a mask?](https://stackoverflow.com/q/36932240) for basically emulating `vpcompressd` for dword elements with AVX2 + BMI2 `pext` / `pdep`. Although maybe an entirely different strategy would make more sense for bytes. — Peter Cordes, May 10 '20 at 19:43
Yes, AVX512. Your first comment sounds expensive, but I expect to find that most of the time not more than 16 rows will be filled, in which case it could work, using the 17th and later only when needed. I'll see your links about not using AVX512. — RTC222, May 10 '20 at 19:46
The hardware left-packing options are dword with AVX512F or bit in a 64-bit register with BMI2, or looking up a shuffle-control vector from a table of masks. Unfortunately for your case, the last option would be a 64-bit index into a table of 2^64 x 64-byte control vectors for `vpermb` (AVX512VBMI: Ice Lake), or splitting that up somehow into multiple `pshufb` lookups for small chunks like 8 elements at most. But vpcompressd can do 16 elements at once efficiently so use that. Yes, it's expensive to left-pack 64 elements. — Peter Cordes, May 10 '20 at 19:58
I just learned that there is now a VPCOMPRESSB instruction, which does not show up at https://www.felixcloutier.com/x86/ so it must be fairly new, but it's in the Oct 2019 release of the Intel combined volumes set. — RTC222, May 10 '20 at 20:04
Oh yeah, https://github.com/HJLebbink/asm-dude/wiki/VPCOMPRESS says it's part of AVX512_VBMI2. I'd forgotten or never noticed that. So yes, if you have an Ice Lake CPU, you're all set! https://en.wikipedia.org/wiki/AVX-512#CPUs_with_AVX-512. According to uops.info, it's 2 uops for port 5 / 2c throughput for the ZMM version: https://www.uops.info/table.html?search=vpcompressb&cb_lat=on&cb_tp=on&cb_uops=on&cb_ports=on&cb_SKX=on&cb_ICL=on&cb_ZEN2=on&cb_measurements=on&cb_base=on&cb_avx512=on. Amusingly they tested it both with and without masking. Without masking it's just a copy :P — Peter Cordes, May 10 '20 at 20:19
See, that's the problem with a lot of AVX-512 instructions -- coding for older hardware. For that the best solution I can see now is your first suggestion of using the 32-bit VPCOMPRESS instruction. That would work. At least Intel now has the VPCOMPRESSB instruction I need, even if it's not available on most existing chips. — RTC222, May 10 '20 at 20:23
That's the same problem as always, just the naming has changed. Like when AVX2 was brand-new, you'd still want an AVX1 or SSE4.2 version of something that could run on existing CPUs. At least `vpcompressd` provides the primitive operation you want. — Peter Cordes, May 10 '20 at 20:26

Emulate AVX512 VPCOMPESSB byte packing without AVX512_VBMI2

0 Answers0