I have populated a zmm register with an array of byte integers from 0-63. The numbers serve as indices into a matrix. Non-zero elements represent rows in the matrix that contain data. Not all rows will contain data.
I am looking for an instruction (or instructions) to do the same thing with a byte array in a zmm register as VPCOMPRESSQ does with a qword array in a zmm register.
Consider a situation where rows 3, 6, 7, 9 and 12 contain data and all the other rows (a total of 64 rows or fewer) are empty. Mask register k1 now contains 64 bits set to 001001001001000 ... 0.
With the VPCOMPRESSQ instruction, I can use k1 to compress the zmm register so that non-zero elements are arranged contiguously from the start of the register, but there is no equivalent instruction for bytes.
PSHUFB appears as a candidate, but the shuffle control mask must be integers, not bits. I can get an integer array 0 0 3 0 0 6, etc, but I still have elements set to zero, and I need them arranged contiguously from the start of a zmm register.
So my question is, given a zmm register with a byte array where k1 is set as shown above, how can I get the non-zero elements arranged contiguously from the beginning of the zmm register, like VPCOMPRESSQ does for qwords?
AVX512VBMI2 has VPCOMPRESSB
, but that's only available in Ice Lake and later. What can be done on Skylake-avx512 / Cascade Lake?