Fastest way to produce a mask with n ones starting at position i

Question

What is the fastest way (in terms of cpu cycles on common modern architecture), to produce a mask with len bits set to 1 starting at position pos:

template <class UIntType>
constexpr T make_mask(std::size_t pos, std::size_t len)
{
    // Body of the function
}

// Call of the function
auto mask = make_mask<uint32_t>(4, 10);
// mask = 00000000 00000000 00111111 11110000 
// (in binary with MSB on the left and LSB on the right)

Plus, is there any compiler intrinsics or BMI function that can help?

Does this have to cover the case where `len` is same as the number of bits of type? That adds extra complications — harold, Sep 04 '16 at 21:23
It would be better if the function was working for `len` >= number of bits of `UIntType` (in this case all bits after `pos` are set to `1`) — Vincent, Sep 04 '16 at 21:32
if len is equal to or greater than the number of bits of that type then if you use the (1< — old_timer, Sep 04 '16 at 21:54
@dwelch: if `int` is 32 bit, `1U<<32` is undefined, not 0. See, for example, http://coliru.stacked-crooked.com/a/ae28ba4070188ace . So if you supply `len` as 32, with this particular expression of UB, `(1< — rici, Sep 04 '16 at 22:47
unsigned int fun ( void ) { return((1<<32)-1); } returns 0xFFFFFFFF as I described. Using g++ that is, there is a warning not an error though. YMMV Your example neglected to subtract one yes? So if you got all zeros then you also proved my point. — old_timer, Sep 05 '16 at 00:01
@dwelch, no I got 1. You can see the output in the link. Since left shifting an int by 32 is UB, there is no guarantee of consistency; your example is evaluated by the compiler while mine is evaluated at runtime, and they have different results. — rici, Sep 05 '16 at 05:07
@rici: related: [a rotate idiom that avoids UB](http://stackoverflow.com/questions/776508/best-practices-for-circular-shift-rotate-operations-in-c). `val << count` on most hardware either saturates the count or takes it modulo the operand width (i.e. looks at only the low 5 bits for a 32-bit shift). This is why you get 0: `1<<32` at runtime is 1 in hardware on x86. Compile-time behaviour may differ, because this is pretty much bogus undefined behaviour. — Peter Cordes, Sep 05 '16 at 10:48
@Vincent: Can you narrow this question down to just what compiles well on x86, or ARM? IIRC PowerPC can do this in one instruction (given an all-ones to start with). Really though, it's not too broad to consider such a simple problem for a few of the major architectures, so I disagree with closing it as too broad. It might be useful to say whether latency or throughput is more important, since cycle-counting on modern out-of-order CPUs is more complicated than that. There are three main factors for pure ALU static analysis: total uop/insn count (frontend impact), latency, and throughput. — Peter Cordes, Sep 05 '16 at 10:53
@PeterCordes: Yeah I knew all that :) It's not "pretty much bogus UB". It's UB and therefore unusable; on at least one popular implementation (gcc on x86), it will differ between runtime and compile-time. The rotate idiom depends on the fact that there are 32 possible rotations for a 32-bit value; however, there are 33 possible shifts so a shift operator whose right-hand domain has size 32 is inadequate. You could do a safe left shift of n by i for i up to 2*(bits(n)-1): `n<<(i>>1)<<(-~i>>1)`. That's got its charm, I suppose. (`-~i` is `i+1` avoiding parentheses). — rici, Sep 06 '16 at 21:35
@rici: right yeah, I probably should have pinged @ dwelch with that comment, since he was the one suggesting 1<<32. I was already working on an answer to this question that points out that the most important criterion is whether len is [0..31], [1..32], or [0..32] (or needs bounds checking), and gives a different optimal implementation for those cases. ( `(-1LL >> (32-len)) << pos` being the other one. It compiles to slightly more/slower asm than the KIIV's answer.) — Peter Cordes, Sep 06 '16 at 21:56
related: duplicates: [What is the best practice way to create a bitmask for a range of bits?](https://stackoverflow.com/q/21465833/995714), [Masking bits within a range given in parameter in C](https://stackoverflow.com/q/28035794/995714), [uint64_t setting a range of bits to 1](https://stackoverflow.com/q/48436659/995714), [Obtain a specific subset of bits of an int in Java](https://stackoverflow.com/q/16001819/995714), [set the m-bit to n-bit](https://stackoverflow.com/q/15917454/995714) — phuclv, Jun 09 '19 at 02:10

KIIV · Answer 1 · 2016-09-04T21:37:34.770

7

Fastest way? I'd use something like this:

template <class T>
constexpr T make_mask(std::size_t pos, std::size_t len)
{
  return ((static_cast<T>(1) << len)-1) << pos;
}

edited Sep 04 '16 at 21:37

answered Sep 04 '16 at 21:33

KIIV

3,534
2
18
23

2

Minor suggestion: changing `(T)1` to `static_cast(1)` will make the parentheses a little less LISP-like, and might be easier to read. (Personally, I'd use the static_cast, but the C-style cast is okay here, too). – Pete Becker Sep 04 '16 at 21:36
I put this on the [Godbolt compiler explorer](https://godbolt.org/g/njWxBD). This answer compiles to really efficient code; looks better than starting with `-1LL` and shifting that. BMI2's `shlx` instruction makes this really efficient (since it's much faster than regular `shl` on Intel Haswell/BDW/Skylake, although even regular variable-count `shl r32, cl` is only 2 cycle latency (and 3 uops)) (See http://stackoverflow.com/tags/x86/info, and http://agner.org/optimize/) – Peter Cordes Sep 05 '16 at 11:08
If `len` can be 32 (or whatever the type width) but not 0, then you should start with `static_cast(-1LL)` and right+left shift that. If `len` can be 0 but not 32, this answer is ideal. If `len` can be 0 or 32, and you need both to work, you need something fancier than either of those solutions. :( Jean-Baptiste's lookup-table could work, if you're sure `len` doesn't need a range-check. (It needs LUT 33 entries, from 0 to 0xFFFFFFFF) – Peter Cordes Sep 05 '16 at 11:17

rici · Accepted Answer · 2016-09-06T21:43:04.620

If by "starting at pos", you mean that the lowest-order bit of the mask is at the position corresponding with 2^pos (as in your example):

((UIntType(1) << len) - UIntType(1)) << pos

If it is possible that len is ≥ the number of bits in UIntType, avoid Undefined Behaviour with a test:

(((len < std::numeric_limits<UIntType>::digits)
     ? UIntType(1)<<len
     : 0) - UIntType(1)) << pos

(If it is also possible that pos is ≥ std::numeric_limits<UIntType>::digits, you'll need another ternary op test.)

You could also use:

(UIntType(1)<<(len>>1)<<((len+1)>>1) - UIntType(1)) << pos

which avoids the ternary op at the cost of three extra shift operators; I doubt whether it would be faster but careful benchmarking would be necessary to know for sure.

Jean-Baptiste Yunès · Answer 3 · 2016-09-04T22:21:51.380

4

Maybe using a table? For type uint32_t you can write:

static uint32_t masks[] = { 0x0, 0x1, 0x3, 0x7, 0xf, 0x1f, 0x3f...}; // only 32 such masks
return masks[len] << pos;

Whatever is the int type the number of masks is not so huge and the table can be easily generated by templates.

For BMI, maybe using BZHI? Starting from all bits set, BZHI with value 32-len and then shift by pos.

edited Sep 04 '16 at 22:21

answered Sep 04 '16 at 22:01

Jean-Baptiste Yunès

34,548
4
48
69

+1. BZHI is likely to be fastest, as you don't need a memory access to your table, but assuming this is happening in a tight loop (and if it isn't, why optimize it?) the table access is probably almost as good. – Periata Breatta Sep 05 '16 at 02:30
Even a full bidimensional table with (pos, len) indexes is thinkable, 64²=4096 entries (of which half are useless, unless you want to play with triangular indexing). – Sep 05 '16 at 07:53
@YvesDaoust: that sounds like a terrible idea. It would be unlikely to stay hot in L1. Even doing part of it with a table lookup sounds dicey unless your code has so much instruction-level parallelism that fewer uops at the cost of more latency is valuable. L1 load-use latency is ~4 cycles on recent Intel CPUs, but I think KIIV's function could have the `(1< – Peter Cordes Sep 05 '16 at 08:12
@PeterCordes: I know, but the question is about fixed len/pos values so you can expect that the same value is reused ever and ever. Anyway all this discussion is unnecessary, see my answer. – Sep 05 '16 at 08:17
@YvesDaoust: yeah, I saw your answer and upvoted it. I'm assuming the other answers are trying to be useful for the case where len/pos aren't compile-time constants, assuming that the OP just used constants to simplify the example. – Peter Cordes Sep 05 '16 at 08:26
@PeterCordes: of course. But the mask definition is explicitly constexpr. – Sep 05 '16 at 08:44
@YvesDaoust: That means it *can* be used where compile-time constants are required. It can also be used in other cases too. It's just generally a good idea to put on functions like this that have no side effects, right? I think your answer is useful in case the OP didn't realize that, but I also think the other answers are useful for the general case (esp. to future readers). – Peter Cordes Sep 05 '16 at 08:57
This actually needs 33 entries, from 0 to 0xFFFFFFFF, to handle all the possibilities from len=0 to len=32. See my comments on KIIV's answer. – Peter Cordes Sep 05 '16 at 11:18

score 3 · Answer 4 · 2016-09-05T08:19:21.030

3

Speed is irrelevant here as the expression is constant, hence precomputed by the optimizer and in all likelyhood used as an immediate operand. Whatever you use, it will cost you 0 cycle.

edited Sep 05 '16 at 08:19

answered Sep 05 '16 at 08:05

Peter Cordes · Answer 5 · 2016-09-19T18:05:36.227

The biggest issue here is the range of possible inputs. In C, shifts with a count larger than the type width are Undefined Behaviour. However, it looks like len can meaningfully range from 0 to the type width. e.g. 33 different lengths for uint32_t. With pos=0, we get masks from 0 to 0xFFFFFFFF. (I'm just going to assume 32-bit in English and asm for clarity, but use generic C++).

If we can exclude either end of that range as possible inputs, then there are only 32 possible lengths, and we can use a left or right shift as a building block. (Use an assert() to verify the input range in debug builds.)

I put several versions (from other answers) of the function on the Godbolt compiler explorer with some macros to compile them with constant len, constant pos, or both inputs variable. Some do better than others. KIIV's looks good for the range it's valid for (len=0..31, pos=0..31).

This version works for len=1..32, and pos=0..31. It generates slightly worse x86-64 asm than KIIV's, so use KIIV's if it works without extra checks.

// right-shift a register of all-ones, then shift it into position.
// works for len=1..32 and pos=0..31
template <class T>
constexpr T make_mask_PJC(std::size_t pos, std::size_t len)
{
//  T all_ones = -1LL;
//  unsigned typebits = sizeof(T)*CHAR_BIT;  // std::numeric_limits<T>::digits
//  T len_ones = all_ones >> (typebits - len);
//  return len_ones << pos

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

// Same idea, but mask the shift count the same way x86 shift instructions do, so the compiler can do it for free.
// Doesn't always compile to ideal code with SHRX (BMI2), maybe gcc only knows about letting the shift instruction do the masking for the older SHR / SHL instructions
uint32_t make_mask_PJC_noUB(std::size_t pos, std::size_t len)
{
  using T=uint32_t;

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");

  T all_ones = -1LL;
  unsigned typebits = std::numeric_limits<T>::digits;
  T len_ones = all_ones >> ( (typebits - len) & (typebits-1));     // the AND optimizes away
  return len_ones << (pos & (typebits-1));

//  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

If len can be anything in [0..32], I don't have any great ideas for efficient branchless code. Perhaps branching is the way to go.

uint32_t make_mask_fullrange(std::size_t pos, std::size_t len)
{
  using T=uint32_t;

  static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");

  T all_ones = -1LL;
  unsigned typebits = std::numeric_limits<T>::digits;
  //T len_ones = all_ones >> ( (typebits - len) & (typebits-1));
  T len_ones = len==0 ? 0 : all_ones >> ( (typebits - len) & (typebits-1));
  return len_ones << (pos & (typebits-1));

//  return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos;  // pre-C++14 constexpr needs it all in one statement
}

Fastest way to produce a mask with n ones starting at position i

5 Answers5

Linked

Related