The biggest issue here is the range of possible inputs. In C, shifts with a count larger than the type width are Undefined Behaviour. However, it looks like len
can meaningfully range from 0 to the type width. e.g. 33 different lengths for uint32_t. With pos=0, we get masks from 0 to 0xFFFFFFFF. (I'm just going to assume 32-bit in English and asm for clarity, but use generic C++).
If we can exclude either end of that range as possible inputs, then there are only 32 possible lengths, and we can use a left or right shift as a building block. (Use an assert()
to verify the input range in debug builds.)
I put several versions (from other answers) of the function
on the Godbolt compiler explorer with some macros to compile them with constant len, constant pos, or both inputs variable. Some do better than others. KIIV's looks good for the range it's valid for (len=0..31, pos=0..31).
This version works for len=1..32, and pos=0..31. It generates slightly worse x86-64 asm than KIIV's, so use KIIV's if it works without extra checks.
// right-shift a register of all-ones, then shift it into position.
// works for len=1..32 and pos=0..31
template <class T>
constexpr T make_mask_PJC(std::size_t pos, std::size_t len)
{
// T all_ones = -1LL;
// unsigned typebits = sizeof(T)*CHAR_BIT; // std::numeric_limits<T>::digits
// T len_ones = all_ones >> (typebits - len);
// return len_ones << pos
static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos; // pre-C++14 constexpr needs it all in one statement
}
// Same idea, but mask the shift count the same way x86 shift instructions do, so the compiler can do it for free.
// Doesn't always compile to ideal code with SHRX (BMI2), maybe gcc only knows about letting the shift instruction do the masking for the older SHR / SHL instructions
uint32_t make_mask_PJC_noUB(std::size_t pos, std::size_t len)
{
using T=uint32_t;
static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
T all_ones = -1LL;
unsigned typebits = std::numeric_limits<T>::digits;
T len_ones = all_ones >> ( (typebits - len) & (typebits-1)); // the AND optimizes away
return len_ones << (pos & (typebits-1));
// return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos; // pre-C++14 constexpr needs it all in one statement
}
If len can be anything in [0..32], I don't have any great ideas for efficient branchless code. Perhaps branching is the way to go.
uint32_t make_mask_fullrange(std::size_t pos, std::size_t len)
{
using T=uint32_t;
static_assert(std::numeric_limits<T>::radix == 2, "T isn't an integer type");
T all_ones = -1LL;
unsigned typebits = std::numeric_limits<T>::digits;
//T len_ones = all_ones >> ( (typebits - len) & (typebits-1));
T len_ones = len==0 ? 0 : all_ones >> ( (typebits - len) & (typebits-1));
return len_ones << (pos & (typebits-1));
// return static_cast<T>(-1LL) >> (std::numeric_limits<T>::digits - len) << pos; // pre-C++14 constexpr needs it all in one statement
}