Creating a mask with N least significant bits set

Question

I would like to create a macro or function¹ mask(n) which given a number n returns an unsigned integer with its n least significant bits set. Although this seems like it should be a basic primitive with heavily discussed implementations which compile efficiently - this doesn't seem to be the case.

Of course, various implementations may have different sizes for the primitive integral types like unsigned int, so let's assume for the sake of concreteness that we are talking returning a uint64_t specifically although of course an acceptable solutions would work (with different definitions) for any unsigned integral type. In particular, the solution should be efficient when the type returned is equal to or smaller than the platform's native width.

Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1. Many "obvious" solutions don't work for one of these two cases.

The most important criteria is correctness: only correct solutions which don't rely on undefined behavior are interesting.

The second most important criteria is performance: the idiom should ideally compile to approximately the most efficient platform-specific way to do this on common platforms.

A solution that sacrifices simplicity in the name of performance, e.g., that uses different implementations on different platforms, is fine.

¹ The most general case is a function, but ideally it would also work as a macro, without re-evaluating any of its arguments more than once.

Davislor · Answer 1 · 2018-10-01T00:53:39.277

6

Try

unsigned long long mask(const unsigned n)
{
  assert(n <= 64);
  return (n == 64) ? 0xFFFFFFFFFFFFFFFFULL :
     (1ULL << n) - 1ULL;
}

There are several great, clever answers that avoid conditionals, but a modern compiler can generate code for this that doesn’t branch.

Your compiler can probably figure out to inline this, but you might be able to give it a hint with inline or, in C++, constexpr.

The unsigned long long int type is guaranteed to be at least 64 bits wide and present on every implementation, which uint64_t is not.

If you need a macro (because you need something that works as a compile-time constant), that might be:

#define mask(n) ((64U == (n)) ? 0xFFFFFFFFFFFFFFFFULL : (1ULL << (unsigned)(n)) - 1ULL)

As several people correctly reminded me in the comments, 1ULL << 64U is potential undefined behavior! So, insert a check for that special case.

You could replace 64U with CHAR_BITS*sizeof(unsigned long long) if it is important to you to support the full range of that type on an implementation where it is wider than 64 bits.

You could similarly generate this from an unsigned right shift, but you would still need to check n == 64 as a special case, since right-shifting by the width of the type is undefined behavior.

ETA:

The relevant portion of the (N1570 Draft) standard says, of both left and right bit shifts:

If the value of the right operand is negative or is greater than or equal to the width of the promoted left operand, the behavior is undefined.

This tripped me up. Thanks again to everyone in the comments who reviewed my code and pointed the bug out to me.

edited Oct 01 '18 at 00:53

answered Sep 30 '18 at 03:45

Davislor

14,674
2
34
49

*Critically, this must work for all n in [0, 64]. In particular mask(0) == 0 and mask(64) == (uint64_t)-1*. – n. m. could be an AI Sep 30 '18 at 03:47
Unsigned math is guaranteed to wrap around in the Standard, so `0U - 1U` sets every bit. I'll review what it says about unsigned left shift overflowing. – Davislor Sep 30 '18 at 03:56
For now, I added a possibly-redundant test. – Davislor Sep 30 '18 at 03:56
3

IDK what it says about shifts but in practice `1ULL << 64` is usually 1, not 0 – harold Sep 30 '18 at 03:59
@Harold Good point. Added a special-case handler and an assertion. Thanks for the code review, guys! – Davislor Sep 30 '18 at 04:03
3

Similarly, a right shift doesn't normally let you shift out all bits, except on PowerPC and maybe some others – harold Sep 30 '18 at 04:25
1

Ugh. Yes, the standard says a right shift of the width of the type is UB. – Davislor Sep 30 '18 at 04:42
@harold Corrected. Thanks again! – Davislor Sep 30 '18 at 04:44
1

[What does the C standard say about bitshifting more bits than the width of type?](https://stackoverflow.com/q/11270492/995714) – phuclv Sep 30 '18 at 10:49
@phuclv Indeed, the question here is what it says about bitshifting *as many* bits as the width of the type. – Davislor Sep 30 '18 at 19:09
@Davislor: the *answer* on that question does say "greater than or equal to the width", which is why phuclv linked it. – Peter Cordes Oct 01 '18 at 00:32

phuclv · Answer 2 · 2020-04-24T05:11:38.810

Another solution without branching

unsigned long long mask(unsigned n)
{
    return ((1ULL << (n & 0x3F)) & -(n != 64)) - 1;
}

n & 0x3F keeps the shift amount to maximum 63 in order to avoid UB. In fact most modern architectures will just grab the lower bits of the shift amount, so no and instruction is needed for this.

The checking condition for 64 can be changed to -(n < 64) to make it return all ones for n ⩾ 64, which is equivalent to _bzhi_u64(-1ULL, (uint8_t)n) if your CPU supports BMI2.

The output from Clang looks better than gcc. As it happens gcc emits conditional instructions for MIPS64 and ARM64 but not for x86-64, resulting in longer output

The condition can also be simplified to n >> 6, utilizing the fact that it'll be one if n = 64. And we can subtract that from the result instead of creating a mask like above

return (1ULL << (n & 0x3F)) - (n == 64) - 1; // or n >= 64
return (1ULL << (n & 0x3F)) - (n >> 6) - 1;

gcc compiles the latter to

mov     eax, 1
shlx    rax, rax, rdi
shr     edi, 6
dec     rax
sub     rax, rdi
ret

Some more alternatives

return ~((~0ULL << (n & 0x3F)) << (n == 64));
return ((1ULL << (n & 0x3F)) - 1) | (((uint64_t)n >> 6) << 63);
return (uint64_t)(((__uint128_t)1 << n) - 1); // if a 128-bit type is available

A similar question for 32 bits: Set last `n` bits in unsigned int

score 4 · Answer 3 · answered Sep 30 '18 at 07:03

4

Here's one that is portable and conditional-free:

unsigned long long mask(unsigned n)
{
    assert (n <= sizeof(unsigned long long) * CHAR_BIT);
    return (1ULL << (n/2) << (n-(n/2))) - 1;
}

answered Sep 30 '18 at 07:03

n. m. could be an AI

112,515
14
128
243

Not terrible if BMI2 is available for `shlx` single-uop variable-count left shift: https://godbolt.org/z/QXW0ID – Peter Cordes Oct 01 '18 at 00:35

score 4 · Answer 4 · answered Jun 09 '19 at 02:39

This is not an answer to the exact question. It only works if `0` isn't a required output, but is more efficient.

2ⁿ⁺¹ - 1 computed without overflow. i.e. an integer with the low n bits set, for n = 0 .. all_bits

Possibly using this inside a ternary for cmov could be a more efficient solution to the full problem in the question. Perhaps based on a left-rotate of a number with the MSB set, instead of a left-shift of 1, to take care of the difference in counting for this vs. the question for the pow2 calculation.

// defined for n=0 .. sizeof(unsigned long long)*CHAR_BIT
unsigned long long setbits_upto(unsigned n) {
    unsigned long long pow2 = 1ULL << n;
    return pow2*2 - 1;                  // one more shift, and subtract 1.
}

Compiler output suggests an alternate version, good on some ISAs if you're not using gcc/clang (which already do this): bake in an extra shift count so it is possible for the initial shift to shift out all the bits, leaving 0 - 1 = all bits set.

unsigned long long setbits_upto2(unsigned n) {
    unsigned long long pow2 = 2ULL << n;      // bake in the extra shift count
    return pow2 - 1;
}

The table of inputs / outputs for a 32-bit version of this function is:

 n   ->  1<<n        ->    *2 - 1
0    ->    1         ->   1        = 2 - 1
1    ->    2         ->   3        = 4 - 1
2    ->    4         ->   7        = 8 - 1
3    ->    8         ->  15        = 16 - 1
...
30   ->  0x40000000  ->  0x7FFFFFFF  = 0x80000000 - 1
31   ->  0x80000000  ->  0xFFFFFFFF  = 0 - 1

You could slap a cmov after it, or other way of handling an input that has to produce zero.

On x86, we can efficiently compute this with 3 single-uop instructions: (Or 2 uops for BTS on Ryzen).

xor  eax, eax
bts  rax, rdi               ; rax = 1<<(n&63)
lea  rax, [rax + rax - 1]   ; one more left shift, and subtract

(3-component LEA has 3 cycle latency on Intel, but I believe this is optimal for uop count and thus throughput in many cases.)

In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family

C compilers unfortunately are dumb and miss using bts even when tuning for Intel CPUs without BMI2 (where shl reg,cl is 3 uops).

e.g. gcc and clang both do this (with dec or add -1), on Godbolt

# gcc9.1 -O3 -mtune=haswell
setbits_upto(unsigned int):
    mov     ecx, edi
    mov     eax, 2       ; bake in the extra shift by 1.
    sal     rax, cl
    dec     rax
    ret

MSVC starts with n in ECX because of the Windows x64 calling convention, but modulo that, it and ICC do the same thing:

# ICC19
setbits_upto(unsigned int):
    mov       eax, 1                                        #3.21
    mov       ecx, edi                                      #2.39
    shl       rax, cl                                       #2.39
    lea       rax, QWORD PTR [-1+rax+rax]                   #3.21
    ret                                                     #3.21

With BMI2 (-march=haswell), we get optimal-for-AMD code from gcc/clang with -march=haswell

    mov     eax, 2
    shlx    rax, rax, rdi
    add     rax, -1

ICC still uses a 3-component LEA, so if you target MSVC or ICC use the 2ULL << n version in the source whether or not you enable BMI2, because you're not getting BTS either way. And this avoids the worst of both worlds; slow-LEA and a variable-count shift instead of BTS.

On non-x86 ISAs (where presumably variable-count shifts are efficient because they don't have the x86 tax of leaving flags unmodified if the count happens to be zero, and can use any register as the count), this compiles just fine.

e.g. AArch64. And of course this can hoist the constant 2 for reuse with different n, like x86 can with BMI2 shlx.

setbits_upto(unsigned int):
    mov     x1, 2
    lsl     x0, x1, x0
    sub     x0, x0, #1
    ret

Basically the same on PowerPC, RISC-V, etc.

technosaurus · Answer 5 · 2019-07-19T20:49:49.187

#include <stdint.h>

uint64_t mask_n_bits(const unsigned n){
  uint64_t ret = n < 64;
  ret <<= n&63; //the &63 is typically optimized away
  ret -= 1;
  return ret;
}

Results:

mask_n_bits:
    xor     eax, eax
    cmp     edi, 63
    setbe   al
    shlx    rax, rax, rdi
    dec     rax
    ret

Returns expected results and if passed a constant value it will be optimized to a constant mask in clang and gcc as well as icc at -O2 (but not -Os) .

Explanation:

The &63 gets optimized away, but ensures the shift is <=64.

For values less than 64 it just sets the first n bits using (1<<n)-1. 1<<n sets the nth bit (equivalent pow(2,n)) and subtracting 1 from a power of 2 sets all bits less than that.

By using the conditional to set the initial 1 to be shifted, no branch is created, yet it gives you a 0 for all values >=64 because left shifting a 0 will always yield 0. Therefore when we subtract 1, we get all bits set for values of 64 and larger (because of 2s complement representation for -1).

Caveats:

1s complement systems must die - requires special casing if you have one
some compilers may not optimize the &63 away

Unfortunately it is UB to shift a 64 bit value by 64 or more. — BeeOnRope, Jul 19 '19 at 14:50
@BeeOnRope : I added the &63 that gets optimized away anyhow. — technosaurus, Jul 19 '19 at 20:33
IIRC, there are some ISAs that saturate their shift counts instead of masking as part of the instruction (e.g. ARM32 but not AArch64). A smart compiler could still legally optimize away the `&63` in this case because the value being shifted is already `0` for higher shift counts. But in practice GCC for ARM32 doesn't, for a 32-bit version of this. https://godbolt.org/z/PiIOcO. It compiles very efficiently for AArch64, though; AArch64's `cset` is better than x86's lame 8-bit `setcc`. — Peter Cordes, Jul 19 '19 at 22:29

Řrřola · Answer 6 · 2021-12-08T12:48:02.617

When the input N is between 1 and 64, we can use -uint64_t(1) >> (64-N & 63).
The constant -1 has 64 set bits and we shift 64-N of them away, so we're left with N set bits.

When N=0, we can make the constant zero before shifting:

uint64_t mask(unsigned N)
{
    return -uint64_t(N != 0) >> (64-N & 63);
}

This compiles to five instructions in x64 clang:

neg sets the carry flag to N != 0.
sbb turns the carry flag into 0 or -1.
shr rax,N already has an implicit N & 63, so 64-N & 63 was optimized to -N.

mov rcx,rdi
neg rcx
sbb rax,rax
shr rax,cl
ret

With the BMI2 extension, it's only four instructions (the shift length can stay in rdi):

neg edi
sbb rax,rax
shrx rax,rax,rdi
ret

if BMI2 is available then you just need `mov rax, -1; bzhi rax, rax, rdi` https://gcc.godbolt.org/z/ocdqa9 — phuclv, Oct 17 '20 at 14:24

Creating a mask with N least significant bits set

6 Answers6

ETA:

This is not an answer to the exact question. It only works if `0` isn't a required output, but is more efficient.

In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family

Linked

Creating a mask with N least significant bits set

6 Answers6

ETA:

This is not an answer to the exact question. It only works if 0 isn't a required output, but is more efficient.

In C this compiles nicely for all 64-bit ISAs except x86 Intel SnB-family

Linked

This is not an answer to the exact question. It only works if `0` isn't a required output, but is more efficient.