115

I have some code that is more or less like this:

#include <bitset>

enum Flags { A = 1, B = 2, C = 3, D = 5,
             E = 8, F = 13, G = 21, H,
             I, J, K, L, M, N, O };

void apply_known_mask(std::bitset<64> &bits) {
    const Flags important_bits[] = { B, D, E, H, K, M, L, O };
    std::remove_reference<decltype(bits)>::type mask{};
    for (const auto& bit : important_bits) {
        mask.set(bit);
    }

    bits &= mask;
}

Clang >= 3.6 does the smart thing and compiles this to a single and instruction (which then gets inlined everywhere else):

apply_known_mask(std::bitset<64ul>&):  # @apply_known_mask(std::bitset<64ul>&)
        and     qword ptr [rdi], 775946532
        ret

But every version of GCC I've tried compiles this to an enormous mess that includes error handling that should be statically DCE'd. In other code, it will even place the important_bits equivalent as data in line with the code!

.LC0:
        .string "bitset::set"
.LC1:
        .string "%s: __position (which is %zu) >= _Nb (which is %zu)"
apply_known_mask(std::bitset<64ul>&):
        sub     rsp, 40
        xor     esi, esi
        mov     ecx, 2
        movabs  rax, 21474836482
        mov     QWORD PTR [rsp], rax
        mov     r8d, 1
        movabs  rax, 94489280520
        mov     QWORD PTR [rsp+8], rax
        movabs  rax, 115964117017
        mov     QWORD PTR [rsp+16], rax
        movabs  rax, 124554051610
        mov     QWORD PTR [rsp+24], rax
        mov     rax, rsp
        jmp     .L2
.L3:
        mov     edx, DWORD PTR [rax]
        mov     rcx, rdx
        cmp     edx, 63
        ja      .L7
.L2:
        mov     rdx, r8
        add     rax, 4
        sal     rdx, cl
        lea     rcx, [rsp+32]
        or      rsi, rdx
        cmp     rax, rcx
        jne     .L3
        and     QWORD PTR [rdi], rsi
        add     rsp, 40
        ret
.L7:
        mov     ecx, 64
        mov     esi, OFFSET FLAT:.LC0
        mov     edi, OFFSET FLAT:.LC1
        xor     eax, eax
        call    std::__throw_out_of_range_fmt(char const*, ...)

How should I write this code so that both compilers can do the right thing? Failing that, how should I write this so that it remains clear, fast, and maintainable?

grooveplex
  • 2,492
  • 4
  • 28
  • 30
Alex Reinking
  • 16,724
  • 5
  • 52
  • 86
  • 4
    Rather than using a loop, can't you construct a mask with `B | D | E | ... | O`? – HolyBlackCat Feb 20 '19 at 12:25
  • 6
    The enum has bit positions rather than already expanded bits, so I could do `(1ULL << B) | ... | (1ULL << O)` – Alex Reinking Feb 20 '19 at 12:26
  • 3
    The downside being that the actual names are long and irregular and it's not nearly as easy to see which flags are in the mask with all that line noise. – Alex Reinking Feb 20 '19 at 12:32
  • 4
    @AlexReinking You could make it one `(1ULL << Constant)` | per line, and align the constant names on the different lines, that would be easier on the eyes. – einpoklum Feb 21 '19 at 15:37
  • I think problem here related to lack of use unsigned type,GCC always had troubles with statically discarding correction for overflow and type conversion in signed/unsigned hybrid.Result of bit shift here is an `int` result of bit operation MAY BE `int` OR may be `long long` depending on value and formally`enum` isn't an equivalent to a an `int` constant. clang calls for "as if", gcc stays pedantic – Swift - Friday Pie Feb 28 '20 at 08:43
  • @Swift-FridayPie -- can you demonstrate that to be the issue? Changing the "Flags" in `important_bits` out for `int` doesn't seem to change the assembly. – Alex Reinking Feb 28 '20 at 09:09

6 Answers6

118

Best version is :

template< unsigned char... indexes >
constexpr unsigned long long mask(){
  return ((1ull<<indexes)|...|0ull);
}

Then

void apply_known_mask(std::bitset<64> &bits) {
  constexpr auto m = mask<B,D,E,H,K,M,L,O>();
  bits &= m;
}

back in , we can do this strange trick:

template< unsigned char... indexes >
constexpr unsigned long long mask(){
  auto r = 0ull;
  using discard_t = int[]; // data never used
  // value never used:
  discard_t discard = {0,(void(
    r |= (1ull << indexes) // side effect, used
  ),0)...};
  (void)discard; // block unused var warnings
  return r;
}

or, if we are stuck with , we can solve it recursively:

constexpr unsigned long long mask(){
  return 0;
}
template<class...Tail>
constexpr unsigned long long mask(unsigned char b0, Tail...tail){
  return (1ull<<b0) | mask(tail...);
}
template< unsigned char... indexes >
constexpr unsigned long long mask(){
  return mask(indexes...);
}

Godbolt with all 3 -- you can switch CPP_VERSION define, and get identical assembly.

In practice I'd use the most modern I could. 14 beats 11 because we don't have recursion and hence O(n^2) symbol length (which can explode compile time and compiler memory usage); 17 beats 14 because the compiler doesn't have to dead-code-eliminate that array, and that array trick is just ugly.

Of these 14 is the most confusing. Here we create an anonymous array of all 0s, meanwhile as a side effect construct our result, then discard the array. The discarded array has a number of 0s in it equal to the size of our pack, plus 1 (which we add so we can handle empty packs).


A detailed explanation of what the version is doing. This is a trick/hack, and the fact you have to do this to expand parameters packs with efficiency in C++14 is one of the reasons why fold expressions were added in .

It is best understood from the inside out:

    r |= (1ull << indexes) // side effect, used

this just updates r with 1<<indexes for a fixed index. indexes is a parameter pack, so we'll have to expand it.

The rest of the work is to provide a parameter pack to expand indexes inside of.

One step out:

(void(
    r |= (1ull << indexes) // side effect, used
  ),0)

here we cast our expression to void, indicating we don't care about its return value (we just want the side effect of setting r -- in C++, expressions like a |= b also return the value they set a to).

Then we use the comma operator , and 0 to discard the void "value", and return the value 0. So this is an expression whose value is 0 and as a side effect of calculating 0 it sets a bit in r.

  int discard[] = {0,(void(
    r |= (1ull << indexes) // side effect, used
  ),0)...};

At this point, we expand the parameter pack indexes. So we get:

 {
    0,
    (expression that sets a bit and returns 0),
    (expression that sets a bit and returns 0),
    [...]
    (expression that sets a bit and returns 0),
  }

in the {}. This use of , is not the comma operator, but rather the array element separator. This is sizeof...(indexes)+1 0s, which also set bits in r as a side effect. We then assign the {} array construction instructions to an array discard.

Next we cast discard to void -- most compilers will warn you if you create a variable and never read it. All compilers will not complain if you cast it to void, it is sort of a way to say "Yes, I know, I'm not using this", so it suppresses the warning.

Yakk - Adam Nevraumont
  • 262,606
  • 27
  • 330
  • 524
  • 38
    Sorry, but that C++14 code is something. I don't know what. – James Feb 20 '19 at 18:34
  • 14
    @James It is a wonderful motivating example of why fold expressions in C++17 are very welcome. It, and similar tricks, turn out to be an efficient way to expand a pack "inplace" without any recursion and that compliers find easy to optimize. – Yakk - Adam Nevraumont Feb 20 '19 at 19:10
  • The array initializer trick works in C++11 too, not sure why you think you need 14... – rubenvb Feb 20 '19 at 22:17
  • 4
    @ruben multi line constexpr is illegal in 11 – Yakk - Adam Nevraumont Feb 20 '19 at 22:40
  • 6
    I can't see myself checking in that C++14 code. I'll stick to the C++11 one since I need that, anyway, but even if I could use it, the C++14 code requires so much explanation I wouldn't. These masks can always be written to have at most 32 elements, so I'm not worried about O(n^2) behavior. After all, if n is bounded by a constant, then it's really O(1). ;) – Alex Reinking Feb 21 '19 at 07:30
  • 9
    To those trying to understand `((1ull< – Henrik Hansen Feb 21 '19 at 10:53
  • 1
    What about using a variable-template instead? Wouldn't that be even better in C++17? `template static constexpr unsigned long long mask = (0ULL | ... | (1ULL< – Deduplicator Feb 21 '19 at 21:45
  • You can look at asm for all 3 versions at once, if you give them different names. https://godbolt.org/z/HpLAff. (gcc/clang/ICC/MSVC output for x86-64. Including a version that takes the bitmap by value as a 64-bit integer. Although with MSVC, it's returned in memory after being passed in a register?? Craptastic calling convention...) – Peter Cordes Feb 21 '19 at 22:25
  • @HenrikHansen and Yakk: Does this *need* to be a binary fold? A unary fold like `((1ull<` case of an empty param list. I guess if we want to support that instead of expecting programmers to notice that the result is `0` in that case. (ICC19 even has a missed optimization there, storing a zero to the stack then reloading it, then storing that to the reference arg. https://godbolt.org/z/vQYVPC. But GCC and Clang are fine. I didn't figure out how to get MSVC to do C++17) – Peter Cordes Feb 21 '19 at 22:39
  • 1
    @PeterCordes I see no reason not to support the empty mask case. You could have generic code generating a pile of masks based off metaprogramming generated integral sequences or whatever, and having `0` simply break seems rude. – Yakk - Adam Nevraumont Feb 22 '19 at 03:07
  • Yeah, the only reason you might decide to make that case break on purpose is because ICC fails to optimize it to zero as efficiently as it could, and you want to stop ICC users from shooting themselves in the foot. Or maybe an empty list doesn't make sense in the context of some use-case, so it's useful to warn users. But you'd probably do that with `static_assert` on the mask to give a meaningful message, not an error. (When I started writing my previous comment, I hadn't spotted the empty-list corner case yet.) – Peter Cordes Feb 22 '19 at 03:15
  • In C++14 version array could be replaced with creation of temporary `std::initializer_list`, the downside is possible requirement of `#include `. Also this construct could be wrapped to macro for convenience. – Predelnik Feb 22 '19 at 10:13
  • Can somebody please explain the initializer for `discard` in the C++14 solution to me? I have no idea how to parse that `{0, (void(r |= (1ull << indexes)), 0)...}` expression! – oliver Feb 22 '19 at 12:11
  • 3
    @oliver start with `{0,0}` -- an array of 2 elements, both 0. Now do this `{0, (++i,0)}` -- also an array of 2 elements, both 0, but this one as a side effect increments `i`. The `(++i,0)` is the *comma* operator, which evaluates the lhs, then discards it, then evaluates the rhs, then returns it. Next, `{0,(++Is,0)...}` is a pack expansion of `(++Is,0)` (assuming `Is` is a pack). Finally, `void(++Is)` is a way to cast the result of `++Is` into an expression of type `void`; in generic code, this avoids someone overriding `operator,` and doing the wrong thing. – Yakk - Adam Nevraumont Feb 22 '19 at 12:19
  • @Yakk-AdamNevraumont: thanks a lot for your explanation, much appreciated! – oliver Feb 22 '19 at 12:37
48

The optimization you're looking for seems to be loop peeling, which is enabled at -O3, or manually with -fpeel-loops. I'm not sure why this falls under the purview of loop peeling rather than loop unrolling, but possibly it's unwilling to unroll a loop with nonlocal control flow inside it (as there is, potentially, from the range check).

By default, though, GCC stops short of being able to peel all the iterations, which apparently is necessary. Experimentally, passing -O2 -fpeel-loops --param max-peeled-insns=200 (the default value is 100) gets the job done with your original code: https://godbolt.org/z/NNWrga

Sneftel
  • 40,271
  • 12
  • 71
  • 104
  • You are _amazing_ thank you! I had no idea this was configurable in GCC! Though for some reason `-O3 -fpeel-loops --param max-peeled-insns=200` fails... It's due to `-ftree-slp-vectorize` apparently. – Alex Reinking Feb 20 '19 at 19:41
  • This solution seems to be limited to the x86-64 target. The output for ARM and ARM64 is still not pretty, which then again might be completely irrelevant for OP. – realtime Feb 21 '19 at 09:24
  • @realtime - it is somewhat relevant, actually. Thanks for pointing out that it doesn't work in this case. Very disappointing that GCC doesn't catch it before being lowered to a platform-specific IR. [LLVM optimizes it](https://godbolt.org/z/5Zaq7-) before any further lowering. – Alex Reinking Feb 21 '19 at 11:37
10

if using only C++11 is a must (&a)[N] is a way to capture arrays. This allows you to write one single recursive function without using helper functions whatsoever:

template <std::size_t N>
constexpr std::uint64_t generate_mask(Flags const (&a)[N], std::size_t i = 0u){
    return i < N ? (1ull << a[i] | generate_mask(a, i + 1u)) : 0ull;
}

assigning it to a constexpr auto:

void apply_known_mask(std::bitset<64>& bits) {
    constexpr const Flags important_bits[] = { B, D, E, H, K, M, L, O };
    constexpr auto m = generate_mask(important_bits); //< here
    bits &= m;
}

Test

int main() {
    std::bitset<64> b;
    b.flip();
    apply_known_mask(b);
    std::cout << b.to_string() << '\n';
}

Output

0000000000000000000000000000000000101110010000000000000100100100
//                                ^ ^^^  ^             ^  ^  ^
//                                O MLK  H             E  D  B

one really has to appreciate C++`s ability to calculate anything turing computable at compile time. It surely still blows my mind (<>).


For the later versions C++14 and C++17 yakk's answer already wonderfully covers that.

Stack Danny
  • 7,754
  • 2
  • 26
  • 55
  • 3
    How does this demonstrate that `apply_known_mask` actually optimizes? – Alex Reinking Feb 20 '19 at 21:50
  • 2
    @AlexReinking: All the scary bits are `constexpr`. And while that's theoretically not sufficient, we know that GCC is quite capable of evaluating `constexpr` as intended. – MSalters Feb 21 '19 at 12:37
8

I would encourage you to write a proper EnumSet type.

Writing a basic EnumSet<E> in C++14 (onwards) based on std::uint64_t is trivial:

template <typename E>
class EnumSet {
public:
    constexpr EnumSet() = default;

    constexpr EnumSet(std::initializer_list<E> values) {
        for (auto e : values) {
            set(e);
        }
    }

    constexpr bool has(E e) const { return mData & mask(e); }

    constexpr EnumSet& set(E e) { mData |= mask(e); return *this; }

    constexpr EnumSet& unset(E e) { mData &= ~mask(e); return *this; }

    constexpr EnumSet& operator&=(const EnumSet& other) {
        mData &= other.mData;
        return *this;
    }

    constexpr EnumSet& operator|=(const EnumSet& other) {
        mData |= other.mData;
        return *this;
    }

private:
    static constexpr std::uint64_t mask(E e) {
        return std::uint64_t(1) << e;
    }

    std::uint64_t mData = 0;
};

This allows you to write simple code:

void apply_known_mask(EnumSet<Flags>& flags) {
    static constexpr EnumSet<Flags> IMPORTANT{ B, D, E, H, K, M, L, O };

    flags &= IMPORTANT;
}

In C++11, it requires some convolutions, but remains possible nonetheless:

template <typename E>
class EnumSet {
public:
    template <E... Values>
    static constexpr EnumSet make() {
        return EnumSet(make_impl(Values...));
    }

    constexpr EnumSet() = default;

    constexpr bool has(E e) const { return mData & mask(e); }

    void set(E e) { mData |= mask(e); }

    void unset(E e) { mData &= ~mask(e); }

    EnumSet& operator&=(const EnumSet& other) {
        mData &= other.mData;
        return *this;
    }

    EnumSet& operator|=(const EnumSet& other) {
        mData |= other.mData;
        return *this;
    }

private:
    static constexpr std::uint64_t mask(E e) {
        return std::uint64_t(1) << e;
    }

    static constexpr std::uint64_t make_impl() { return 0; }

    template <typename... Tail>
    static constexpr std::uint64_t make_impl(E head, Tail... tail) {
        return mask(head) | make_impl(tail...);
    }

    explicit constexpr EnumSet(std::uint64_t data): mData(data) {}

    std::uint64_t mData = 0;
};

And is invoked with:

void apply_known_mask(EnumSet<Flags>& flags) {
    static constexpr EnumSet<Flags> IMPORTANT =
        EnumSet<Flags>::make<B, D, E, H, K, M, L, O>();

    flags &= IMPORTANT;
}

Even GCC trivially generates an and instruction at -O1 godbolt:

apply_known_mask(EnumSet<Flags>&):
        and     QWORD PTR [rdi], 775946532
        ret
Matthieu M.
  • 287,565
  • 48
  • 449
  • 722
  • 2
    In [tag:C++11] much of your `constexpr` code isn't legal. I mean, some has 2 statements! (C++11 constexpr sucked) – Yakk - Adam Nevraumont Feb 20 '19 at 17:12
  • @Yakk-AdamNevraumont: You did realize that I posted **2** versions of the code, the first for C++14 onward, and a second specially tailored for C++11? (to account for its limitations) – Matthieu M. Feb 20 '19 at 17:29
  • 1
    It might be better to use std::underlying_type instead of std::uint64_t. – James Feb 20 '19 at 18:32
  • @James: Actually, no. Do note that `EnumSet` does not use a value of `E` as value directly, but instead uses `1 << e`. It's a different domain altogether, which is actually what makes the class so valuable => no chance of accidentally indexing by `e` instead of `1 << e`. – Matthieu M. Feb 20 '19 at 18:53
  • @MatthieuM. Yes you're right. I'm confusing it with our own implementation which is very similar to yours. The disadvantage of using (1 << e) is that if e is out of bounds for the underlying_type's size then it's probably UB, hopefully a compiler error. – James Feb 20 '19 at 18:55
  • @James: Yes. In the full-fledged implementation I use a smart enum instrumented with `constexpr std::underlying_type_t min(Id)` (and equivalent `max`), allowing me to select at compile-time the most suitable unsigned integral type for storage. I'll leave that as an exercise to the reader :) – Matthieu M. Feb 20 '19 at 19:02
  • Ah, I missed that you striped `constexpr` off `EnumSet& operator&=` and similar. My bad. – Yakk - Adam Nevraumont Feb 20 '19 at 19:28
  • @Yakk-AdamNevraumont: Honestly, it's been so long since I last used C++11, I had to get godbolt to highlight the issues. I had long forgotten how restricted C++11 `constexpr` was :( – Matthieu M. Feb 20 '19 at 19:51
  • @MatthieuM. Still better than macros! – Yakk - Adam Nevraumont Feb 20 '19 at 21:27
7

Since C++11 you could also use classic TMP technique:

template<std::uint64_t Flag, std::uint64_t... Flags>
struct bitmask
{
    static constexpr std::uint64_t mask = 
        bitmask<Flag>::value | bitmask<Flags...>::value;
};

template<std::uint64_t Flag>
struct bitmask<Flag>
{
    static constexpr std::uint64_t value = (uint64_t)1 << Flag;
};

void apply_known_mask(std::bitset<64> &bits) 
{
    constexpr auto mask = bitmask<B, D, E, H, K, M, L, O>::value;
    bits &= mask;
}

Link to Compiler Explorer: https://godbolt.org/z/Gk6KX1

The advantage of this approach over template constexpr function is that it's potentially slightly faster to compile due to rule of Chiel.

Michał Łoś
  • 633
  • 4
  • 15
0

There are some far to 'clever' ideas here. You are probably not helping maintainability by following them.

is

{B, D, E, H, K, M, L, O};

so much easier to write than

(B| D| E| H| K| M| L| O);

?

Then none of the rest of the code is needed.

ANone
  • 202
  • 1
  • 4