4

One of the benefits of Intel's AVX-512 extension is that nearly all operations can be masked by providing in addition to the vector register a kreg which specifies a mask to apply to the operation: elements excluded by the mask may either be set to zero or retain their previous value.

A particularly common use of the kreg is to create a mask that excludes N contiguous elements at the beginning or end of a vector, e.g., as the first or final iteration in a vectorized loop where less than a full vector would be processed. E.g., for a loop over 121 int32_t values, the first 112 elements could be handled by 7 full 512-bit vectors, but that leaves 9 elements left over which could be handled by masked operations which operate only on the first 9 elements.

So the question is, given a (runtime valued) integer r which is some value in the range 0 - 16 representing remaining elements, what's the most efficient way to load a 16-bit kreg such that the low r bits are set and the remaining bits unset? KSHIFTLW seems unsuitable for the purpose because it only takes an immediate.

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
  • I assume/hope `(~0) >> (16-r)` compiles efficiently to 32-bit MOV-immediate + SHRX + KMOVW. Err I guess a sub or neg in there, too, and x86 shifts only mask their count by `&31`, not `&15` for 15-bit shifts. Perhaps a better sequence on Intel CPUs is `xor`-zero / BTS / DEC / KMOVW – Peter Cordes Nov 19 '19 at 08:45
  • Possible duplicate of [this question](https://stackoverflow.com/q/54809132/2439725). – wim Nov 20 '19 at 13:53

1 Answers1

1

BMI2 bzhi does exactly what you want: Zero High Bits Starting with Specified Bit Position. Every CPU with AVX512 so far has BMI2.

__mmask16 k = _bzhi_u32(-1UL, r);

This costs 2 instructions, both single-uop: mov-immediate and bzhi. It's even single-cycle latency. (Or 3 cycles on KNL)

  • For r=0, it zeros all the bits giving 0.
  • For r=1, it leaves only the low bit (bit #0) giving 1
  • For r=12, it zeros bit #12 and higher, leaving 0x0FFF (12 bits set)
  • For r>=32 BZHI leaves all 32 bits set (and sets CF)

The INDEX is specified by bits 7:0 of the second source operand

If you had a single-vector-at-a-time cleanup loop that runs after an unrolled vector loop, you could even use this every loop iterations, counting the remaining length down towards zero, instead of a separate last-vector cleanup. It leaves all bits set for high lengths. But this costs 2 uops inside the loop, including port 5 kmovw, and means your main loop would have to use masked instructions. This only works for r<=255 because it only looks at the low byte, not the full integer index. But the mov reg, -1 can be hoisted because bzhi doesn't destroy it.


PS. Normally I think you'd want to arrange your cleanup to handle 1..16 elements, (or 0..15 if you branch to maybe skip it). But the full 17-possibility 0..16 makes sense if this cleanup also handles small lengths that never enter the main loop at all, and len=0 is possible. (And your main loop exits with length remaining = 1..16 so the final iteration can be unconditional)

BeeOnRope
  • 60,350
  • 16
  • 207
  • 386
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847