One of the benefits of Intel's AVX-512 extension is that nearly all operations can be masked by providing in addition to the vector register a kreg which specifies a mask to apply to the operation: elements excluded by the mask may either be set to zero or retain their previous value.
A particularly common use of the kreg is to create a mask that excludes N contiguous elements at the beginning or end of a vector, e.g., as the first or final iteration in a vectorized loop where less than a full vector would be processed. E.g., for a loop over 121 int32_t
values, the first 112 elements could be handled by 7 full 512-bit vectors, but that leaves 9 elements left over which could be handled by masked operations which operate only on the first 9 elements.
So the question is, given a (runtime valued) integer r
which is some value in the range 0 - 16 representing remaining elements, what's the most efficient way to load a 16-bit kreg such that the low r
bits are set and the remaining bits unset? KSHIFTLW
seems unsuitable for the purpose because it only takes an immediate.