What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?

Question

My question is about what the compiler is doing in this case that optimizes the code way more than what I would think is possible.

Given this enum:

enum MyEnum {
    Entry1,
    Entry2,
    ...   // Entry3..27 are the same, omitted for size.
    Entry28,
    Entry29
};

And this function:

bool MyFunction(MyEnum e)
{
    if (
    e == MyEnum::Entry1 || 
    e == MyEnum::Entry3 || 
    e == MyEnum::Entry8 || 
    e == MyEnum::Entry14 || 
    e == MyEnum::Entry15 ||
    e == MyEnum::Entry18 ||
    e == MyEnum::Entry21 || 
    e == MyEnum::Entry22 ||
    e == MyEnum::Entry25)
    {
        return true;
    }
    return false;

}

For the function, MSVC generates this assembly when compiled with -Ox optimization flag (Godbolt):

bool MyFunction(MyEnum) PROC                  ; MyFunction
        cmp     ecx, 24
        ja      SHORT $LN5@MyFunction
        mov     eax, 20078725                   ; 01326085H
        bt      eax, ecx
        jae     SHORT $LN5@MyFunction
        mov     al, 1
        ret     0
$LN5@MyFunction:
        xor     al, al
        ret     0

Clang generates similar (slightly better, one less jump) assembly when compiled with -O3 flag:

MyFunction(MyEnum):                  # @MyFunction(MyEnum)
        cmp     edi, 24
        ja      .LBB0_2
        mov     eax, 20078725
        mov     ecx, edi
        shr     eax, cl
        and     al, 1
        ret
.LBB0_2:
        xor     eax, eax
        ret

What is happening here? I see that even if I add more enum comparisons to the function, the assembly that is generated does not actually become "more", it's only this magic number (20078725) that changes. That number depends on how many enum comparisons are happening in the function. I do not understand what is happening here.

The reason why I am looking at this is that I was wondering if it is good to write the function as above, or alternatively like this, with bitwise comparisons:

bool MyFunction2(MyEnum e)
{
    if (
    e == MyEnum::Entry1 | 
    e == MyEnum::Entry3 |
    e == MyEnum::Entry8 |
    e == MyEnum::Entry14 |
    e == MyEnum::Entry15 |
    e == MyEnum::Entry18 |
    e == MyEnum::Entry21 |
    e == MyEnum::Entry22 |
    e == MyEnum::Entry25)
    {
        return true;
    }
    return false;

}

This results in this generated assembly with MSVC:

bool MyFunction2(MyEnum) PROC           ; MyFunction2
        xor     edx, edx
        mov     r9d, 1
        cmp     ecx, 24
        mov     eax, edx
        mov     r8d, edx
        sete    r8b
        cmp     ecx, 21
        sete    al
        or      r8d, eax
        mov     eax, edx
        cmp     ecx, 20
        cmove   r8d, r9d
        cmp     ecx, 17
        sete    al
        or      r8d, eax
        mov     eax, edx
        cmp     ecx, 14
        cmove   r8d, r9d
        cmp     ecx, 13
        sete    al
        or      r8d, eax
        cmp     ecx, 7
        cmove   r8d, r9d
        cmp     ecx, 2
        sete    dl
        or      r8d, edx
        test    ecx, ecx
        cmove   r8d, r9d
        test    r8d, r8d
        setne   al
        ret     0

Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.

probably a missed optimization here, but [MSVC does know how to emit a bit lookup table](https://stackoverflow.com/q/26124620/995714) like this — phuclv, Dec 06 '22 at 16:46

Matteo Italia · Accepted Answer · 2019-09-26T08:24:37.690

7

Quite smart! The first comparison with 24 is to do a rough range check - if it's more than 24 or less than 0 it will bail out; this is important as the instructions that follow that operate on the magic number have a hard cap to [0, 31] for operand range.

For the rest, the magic number is just a bitmask, with the bits corresponding to the "good" values set.

>>> bin(20078725)
'0b1001100100110000010000101'

It's easy to spot the first and third bits (counting from 1 and from right) set, the 8th, 14th, 15th, ...

MSVC checks it "directly" using the BT (bit test) instruction and branching, clang instead shifts it of the appropriate amount (to get the relevant bit in the lowest order position) and keeps just it ANDing it with zero (avoiding a branch).

The C code corresponding to the clang version would be something like:

bool MyFunction(MyEnum e) {
    if(unsigned(e) > 24) return false;
    return (20078725 >> e) & 1;
}

as for the MSVC version, it's more like

inline bool bit_test(unsigned val, int bit) {
    return val & (1<<bit);
}

bool MyFunction(MyEnum e) {
    if(unsigned(e) > 24) return false;
    return bit_test(20078725, e);
}

(I kept the bit_test function separated to emphasize that it's actually a single instruction in assembly, that val & (1<<bit) thing has no correspondence to the original assembly.

As for the if-based code, it's quite bad - it uses a lot of CMOV and ORs the results together, which is both longer code, and will probably serialize execution. I suspect the corresponding clang code will be better. OTOH, you wrote this code using bitwise OR (|) instead of the more semantically correct logical OR (||), and the compiler is strictly following your orders (typical of MSVC).

Another possibility to try instead could be a switch - but I don't think there's much to gain compared to the code already generated for the first snippet, which looks pretty good to me.

Ok, doing a quick test with all the versions against all compilers, we can see that:

the C translation of the CLang output above results in pretty much that same code (= to the clang output) in all compilers; similarly for the MSVC translation;
the bitwise or version is the same as the logical or version (= good) in both CLang and gcc;
in general, gcc does essentially the same thing as CLang except for the switch case;
switch results are varied:
- CLang does best, by generating the exact same code;
- both gcc and MSVC generate jump-table based code, which in this case is less good; however:
  - gcc prefers to emit a table of QWORDs, trading size for simplicity of the setup code;
  - MSVC instead emits a table of BYTEs, paying it in setup code size; I couldn't get gcc to emit similar code even changing -O3 to -Os (optimize for size).

edited Sep 26 '19 at 08:24

answered Sep 26 '19 at 06:03

Matteo Italia

123,740
17
206
299

Thanks! I didn't quite understand yet how the stuff with the bitmask works here - if I use a bitmask for comparisons, am I not limited to 8 values with an 8 bit enum? I do check more than 8 values here, and the geerated assembly is same if I specifically set the enum to be of type `char`. What is the max number of enum values to compare that this optimization works with? Could you maybe tell me how the "manual" code for doing this exact bitmask check would look like? Also,, I'm not quite sure what you mean with "if-based code" and "switch-based code"? – JohnAl Sep 26 '19 at 06:18
1

@JohnAl For the code to work "as is", the maximum enum value is 31 - remember that "regular" x86 register size is 32 bit. I was going to post the corresponding C code, but from the bitmask it seems that the values of the enum aren't consecutive or something, could you post the actual definition of the enum? As for the `switch`, sorry it was a brainfart - for some reason while writing the answer I was thinking that the first snippet was done with a `switch` instead of an `if`; I fixed that part of the answer afterwards. – Matteo Italia Sep 26 '19 at 06:22
The enum is consecutive. I have added the full enum definition to the original question. – JohnAl Sep 26 '19 at 06:25
2

@JohnAl if you are writing questions about C, C++ and optimization I'm likely to spot them and try to answer . Looking at your first question, I do remember it fondly, both because it was an interesting problem, and the reply was a lot of fun to write and very well received. Coincidentally, it was written on the exact same train as this one, so it happens that you post questions exactly during my commute, when I'm quite likely to reply – Matteo Italia Sep 26 '19 at 06:25
Argh ok I was stupid and forgot a zero. Yes the magic value now makes sense. I'll update the answer. – Matteo Italia Sep 26 '19 at 06:30
Interesting! I have just tested with extending the enum to have 50 entries instead of 29, and extending the function to do 20 comparisons with as randomly as possible distributed entries from the enum, and the generated assembly is still exactly the same, just that the magic number then is 168003772768389. So I don't quite understand where that "hard cap to [0, 31] for operand range" as you mentioned comes into place? – JohnAl Sep 26 '19 at 06:33
3

@JohnAl: ah the old immediate bitmap trick. GCC does this too, at least for a `switch`. e.g. [x86 asm casetable implementation](//stackoverflow.com/q/47779014). (I first noticed it when commenting here: [Advantage of switch over if-else statement](//stackoverflow.com/posts/comments/52559281)). GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3. Another example of using it, this time for code-golf to detect certain letters: [User Appreciation Challenge #1: Dennis ♦](//codegolf.stackexchange.com/a/123458) – Peter Cordes Sep 26 '19 at 06:33
2

@JohnAl: And yes, turning a bitmap into a boolean is much more efficient than a sequence of tests. `bt` is cheap, and so is `shl` (although with a variable count is 3 uops on Intel: https://agner.org/optimize/). When you want to branch on it, instead of a boolean integer result, `bt` is probably good. Or use the CF result of the `shr` with the bitmap pre-shifted, so CF = the bit shifted out. Anyway, MSVC's code for the `|` instead of `||` version is a big missed-optimization. – Peter Cordes Sep 26 '19 at 06:35
1

@JohnAI: check the assembly, over a range comprising more than 32 values it should switch to 64 bit registers (indeed, 168003772768389 doesn't fit 32 bit); look for rax, rdi, rcx, ... instead of eax, edi, ecx. Over 64 values, it will either explode it in two bitmask checks, or will generate completely different code. – Matteo Italia Sep 26 '19 at 06:38
3

@JohnAl: a 32-bit integer bitmap only has 32 entries. For x86-64 obviously you can still easily use 64-bit integers. It doesn't have to be `[0,31]`, though: any contiguous range of up to 32, or up to 64, values will work. It just means an extra `sub` before the range-check branch and then bitmap test. You could even use an arbitrary-size bitmap in memory (as data, not an immediate), either with `bt [mem[,reg` (inefficient) or manually indexing a dword and checking that. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless... – Peter Cordes Sep 26 '19 at 06:39
1

To express `bt` in C: `dst & (1< – Peter Cordes Sep 26 '19 at 06:47
Thank you both! That makes a lot of sense. @MatteoItalia Your C code does result in quite different assembly when compiled with MSVC, but I guess that's not too relevant. It's good to see that MSVC isn't too bad when it comes to optimizing this. – JohnAl Sep 26 '19 at 06:48
@PeterCordes: about `bt`: yes of course, but I preferred the other one as it has a straight, 1:1 translation in C which doesn't generate confusion; translating `bt` with the `1<<` and `&` is a bit more hairy, as someone who don't know x86 assembly may start looking for a corresponding `shl`/`and` in the code, when it's all done in an instruction that actually just sets the flags. – Matteo Italia Sep 26 '19 at 07:02
1

Yes, agreed on the choice for pedagogical reasons, and because it's more efficient and easier to express in C because of less branching. I guess I was just nit-picking at the wording of the text explaining why you chose the SHR version. – Peter Cordes Sep 26 '19 at 07:09
@PeterCordes: that's a perfectly legit consideration, I'll try to clarify the wording - after all I'm not at my top reply game when I'm half asleep on the train :-D – Matteo Italia Sep 26 '19 at 08:01
@MatteoItalia: turned some of my comments into an answer. On further consideration, I think `bt`/`setc` is actually best: only 2 uops vs. 4 for `shr`/`and`, or 5 if you include the `mov` to put the arg in `cl`. – Peter Cordes Sep 26 '19 at 10:53

Peter Cordes · Answer 2 · 2019-09-26T11:06:19.260

Ah, the old immediate bitmap trick.

GCC does this too, at least for a switch. x86 asm casetable implementation. Unfortunately GCC9 has a regression for some cases: https://gcc.gnu.org/bugzilla/show_bug.cgi?id=91026#c3 ; GCC8 and earlier do a better job.

Another example of using it, this time for code-golf (fewest bytes of code, in this case x86 machine code) to detect certain letters: User Appreciation Challenge #1: Dennis ♦

The basic idea is to use the input as an index into a bitmap of true/false results.

First you have to range-check because the bitmap is fixed-width, and x86 shifts wrap the shift count. We don't want high inputs to alias into the range where there are some that should return true. cmp edi, 24/ja is doing.

(If the range between the lowest and highest true values was from 120 to 140, for example, it might start with a sub edi,120 to range-shift everything before the cmp.)

Then you use bitmap & (1<<e) (the bt instruction), or (bitmap >> e) & 1 (shr / and) to check the bit in the bitmap that tells you whether that e value should return true or false.

There are many ways to implement that check, logically equivalent but with performance differences.

If the range was wider than 32, it would have to use 64-bit operand-size. If it was wider than 64, the compiler might not attempt this optimization at all. Or might still do it for some of the conditions that are in a narrow range.

Using an even larger bitmap (in .rodata memory) would be possible but probably not something most compilers will invent for you. Either with bt [mem],reg (inefficient) or manually indexing a dword and checking that the same way this code checks the immediate bitmap. If you had a lot of high-entropy ranges it might be worth checking 2x 64-bit immediate bitmap, branchy or branchless...

Clang/LLVM has other tricks up its sleeve for efficiently comparing against multiple values (when it doesn't matter which one is hit), e.g. broadcast a value into a SIMD register and use a packed compare. That isn't dependent on the values being in a dense range. (Clang generates worse code for 7 comparisons than for 8 comparisons)

that optimizes the code way more than what I would think is possible.

These kinds of optimizations come from smart human compiler developers that notice common patterns in source code and think of clever ways to implement them. Then get compilers to recognize those patterns and transform their internal representation of the program logic to use the trick.

Turns out that switch and switch-like if() statements are common, and aggressive optimizations are common.

Compilers are far from perfect, but sometimes they do come close to living up to what people often claim; that compilers will optimize your code for you so you can write it in a human-readable way and still have it run near-optimally. This is sometimes true over the small scale.

Since I do not understand what happens in the first case, I can not really judge which one is more efficient in my case.

The immediate bitmap is vastly more efficient. There's no data memory access in either one so no cache miss loads. The only "expensive" instruction is a variable-count shift (3 uops on mainstream Intel, because of x86's annoying FLAGS-setting semantics; BMI2 shrx is only 1 uop and avoid having to mov the number to ecx.) https://agner.org/optimize. And see other performance analysis links in https://stackoverflow.com/tags/x86/info.

Each instruction in the cmp/cmov chain is at least 1 uop, and there's a pretty long dependency chain through each cmov because MSVC didn't bother to break it into 2 or more parallel chains. But regardless it's just a lot of uops, far more than the bitmap version, so worse for throughput (ability for out-of-order exec to overlap the work with surrounding code) as well as latency.

bt is also cheap: 1 uop on modern AMD and Intel. (bts, btr, btc are 2 on AMD, still 1 on Intel).

The branch in the immediate-bitmap version could have been a setna / and to make it branchless, but especially for this enum definition the compiler expected that it would be in range. It could have increased branch predictability by only requiring e <= 31, not e <= 24.

Since the enum only goes up to 29, and IIRC it's UB to have out-of-range enum values, it could actually optimize it away entirely.

Even if the e>24 branch doesn't predict very well, it's still probably better overall. Given current compilers, we only get a choice between the nasty chain of cmp/cmov or branch + bitmap. Unless turn the asm logic back into C to hand-hold compilers into making the asm we want, then we can maybe get branchless with an AND or CMOV to make it always zero for out-of-range e.

But if we're lucky, profile-guided optimization might let some compilers make the bitmap range check branchless. (In asm the behaviour of shl reg, cl with cl > 31 or 63 is well-defined: on x86 it simply masks the count. In a C equivalent, you could use bitmap >> (e&31) which can still optimize to a shr; compilers know that x86 shr masks the count so they can optimize that away. But not for other ISAs that saturate the shift count...)

There are lots of ways to implement the bitmap check that are pretty much equivalent. e.g. you could even use the CF output of shr, set according to the last bit shifted out. At least if you make sure CF has a known state ahead of time for the cl=0 case.

When you want an integer bool result, right-shifting seems to make more sense than bt / setcc, but with shr costing 3 uops on Intel it might actually be best to use bt reg,reg / setc al. Especially if you only need a bool, and can use EAX as your bitmap destination so the previous value of EAX is definitely ready before setcc. (Avoiding a false dependency on some unrelated earlier dep chain.)

BTW, MSVC has other silliness: as What is the best way to set a register to zero in x86 assembly: xor, mov or and? explains, xor al,al is totally stupid compared to xor eax,eax when you want to zero AL. If you don't need to leave the upper bytes of RAX unmodified, zero the full register with a zeroing idiom.

And of course branching just to return 0 or return 1 makes little sense, unless you expect it to be very predictable and want to break the data dependency. I'd expect that setc al would make more sense to read the CF result of bt

Half of this is stuff I posted as comments on Matteo's answer. After realizing it was almost a separate answer, I wrote one up. — Peter Cordes, Sep 26 '19 at 10:40

What is the compiler doing here that allows comparison of many values to be done with few actual comparisons?

2 Answers2

Linked