How to convert bool to int efficiently?

Question

I want to convert a bool into an int. The "standard" option is:

static int F(bool b) 
{
    int i = Convert.ToInt32(b);

    return i;
}

//ILSpy told me about this
public static int ToInt32(bool value)
{
    if (!value)
    {
        return 0;
    }
    return 1;
}

This code generates following assembly:

<Program>$.<<Main>$>g__F|0_0(Boolean)
    L0000: test cl, cl
    L0002: jne short L0008
    L0004: xor eax, eax
    L0006: jmp short L000d
    L0008: mov eax, 1
    L000d: ret

As you might have noticed this is a questionable way (I think it is, I'm not an expert) of converting a bool into an int.

What I have tried

The hunt is for the following assembly which is generated by the GCC:

code:

__attribute__((ms_abi)) 
int
f(bool b) {
        int i;
        i = (int)b;

        return i;
}

asm:

f(bool):
        movzx   eax, cl
        ret

In the first step I combined the functions:

static int G(bool b) 
{
    int i = b == true ? 1 : 0;

    return i;
}

I think it helped a bit (see the comments in code).

<Program>$.<<Main>$>g__G|0_1(Boolean)
    L0000: test cl, cl
    L0002: jne short L0007
    L0004: xor eax, eax
    L0006: ret            ; This returns directly instead of jumping to RET instruction.
    L0007: mov eax, 1
    L000c: ret

In the next step I tried to use unsafe tricks:

static unsafe int H(bool b) 
{
    int i = *(int*)&b;         

    return i;
}

this generates:

<Program>$.<<Main>$>g__H|0_2(Boolean)
    L0000: mov [rsp+8], ecx           ; it looks better but I can't get rid of this line
    L0004: mov eax, [rsp+8]           ; it looks better but I can't get rid of this line
    L0008: movzx eax, al
    L000b: ret

In the next step I removed the temp variable (I thought it would help).

static unsafe int Y(bool b) 
{
    return *(int*)&b;
}

this generates the same ASM:

<Program>$.<<Main>$>g__Y|0_3(Boolean)
    L0000: mov [rsp+8], ecx
    L0004: mov eax, [rsp+8]
    L0008: movzx eax, al
    L000b: ret

Question

As you can see I'm stuck here (I don't know how to remove the first 2 instructions). Is there a way of converting a bool variable into an int one?

Note

In case you want to play with example: here is the SharpLab link.
Benchmark results:

on x64/Release for 5000000000 iterations:

H() took ~1320ms
F() took ~1610ms

Included the code for benchmarking:

var w = new Stopwatch();

long r = 0;
for (int i = 0; i < 10; ++i)
{
    w.Restart();
    for (long j = 0; j < 5000000000; ++j)
    {
        F(true);
        F(false);
    }
    w.Stop();
    r += w.ElapsedMilliseconds;
    Console.WriteLine(w.ElapsedMilliseconds);
}

Console.WriteLine("AV" + r / 10);

@PavelAnikhouski they are slow compared to the `C++` version. I think maybe can someone tell me if there is a way of getting an efficient version of it. — , Apr 07 '21 at 11:32
Why [`Convert.ToInt32(bool)`](https://learn.microsoft.com/en-us/dotnet/api/system.convert.toint32?view=net-5.0#System_Convert_ToInt32_System_Boolean_) uses a "questionable way"? — Tim Schmelter, Apr 07 '21 at 11:34
If I'm not wrong. `MOVZX` has latency of `1 cycle`, where the 1st version has latency of `~3-4 cycles`? I'm not used to measure assembly instructions. Correct me if I'm wrong. — , Apr 07 '21 at 11:36
@Hrant Did you measure, how much is it slower? Does it really meaningful in real apps? — Pavel Anikhouski, Apr 07 '21 at 11:37
Note that using `Unsafe.As` produces the same as your `unsafe` pointer code — canton7, Apr 07 '21 at 11:38
This is a toy example. Didn't measure them, however I think it'll be interesting to see if there is a faster way of doing this. Lemme actually measure them and include into my question, so we can have more reliable discussion. — , Apr 07 '21 at 11:38
Looking at the generated assembly code is a bad way of inspecting the efficiency of the C# compiler or runtime, since the JIT automatically generates more verbose, less efficient code if it detects an attached debugger. First compare the generated IL code for the different solutions. — PMF, Apr 07 '21 at 11:39
I think this doesn't matter. The point is that the just-in-time compiler may generate different assembly, based on whether a debugger is attached or not (independent of whether the C# compiler was running in release or debug). Try whether the code looks different if you attach to the running exe. — PMF, Apr 07 '21 at 11:42
Updated: Now you can look at the benchmark result in Notes section. — , Apr 07 '21 at 11:47
@PMF there is way too much noise too understand what it generated to be honest. Is there a good way or a technique to look into that? — , Apr 07 '21 at 11:58
FWIW, your `G` function above is equivalent to `bool.GetHashCode()`. In other words, the CLR authors didn't see an obvious improvement to what you did there. — 500 - Internal Server Error, Apr 07 '21 at 12:02
I don't know for sure, but I would first look into the generated IL. If that is as short as possible (in this case probably a CONV.I instruction), the rest is up for the JIT and you typically don't have to worry about it, since it's optimization is typically very good and depends on the actual CPU. — PMF, Apr 07 '21 at 12:04
@PMF yeah but `IL` is not the true `ASM` which is actually running on my computer. Sure, I could look into that, but I don't think that we could prove there that X is faster than Y. Correct me if I'm wrong. — , Apr 07 '21 at 12:07
To really understand which is quicker you have look at the pipelining of the micro and the width of the memory. If all the micro code is on one row of a 64 bit with memory the execution time is probably the same. — jdweng, Apr 07 '21 at 12:09
@jdweng I'm unfamiliar with that method. Could you maybe write an answer how would one go and measure the true speed of X and Y in C#. — , Apr 07 '21 at 12:13
Do both 1 million times and compare the differences. the OPCODEs you are using are 8 bits so there are 8 OPCODEs in a 64 bit wide memory. The one memory word is read by the processor in one read instruction. Then the 8 OP codes are put into the pipeline and executed in parallel. — jdweng, Apr 07 '21 at 12:20
When manipulating generated IL you can use Fody to clean up. For `if (!value)` error in branch prediction is aroun 15-20 cycle. There might be some obvious CPU intrinsics. — Drag and Drop, Apr 07 '21 at 12:41
@DragandDrop The intrinsics I know don't work with `bools` I always have to pick an "int-like" type. — , Apr 07 '21 at 12:47
Sure, IL is not ASM, but if you look at the IL, you might get an idea on whether the difference is in the compiler or the JIT, which might be relevant. — PMF, Apr 07 '21 at 13:34
@jdweng: no, the opcode for `movzx eax, cl` is 2 bytes (https://www.felixcloutier.com/x86/movzx), and the whole insn is 3 bytes long because it also includes a ModRM byte to specify the register (or memory) operands. Instruction-length isn't very significant in performance on modern x86 CPUs; hot code often runs from the uop cache which caches the result of decoding to uops. See https://www.realworldtech.com/sandy-bridge/4/ and [Agner Fog's microarch guide](//agner.org/optimize/). Also, modern Intel pipelines are 4 or 5-wide, not 8, and fetch from I-cache in 16 byte chunks on uop-cache miss. — Peter Cordes, Apr 07 '21 at 19:43
@canton7: `movzx eax, cl` has *zero* latency on current Intel CPUs ([Can x86's MOV really be "free"? Why can't I reproduce this at all?](//stackoverflow.com/q/44169342)), so the only cost is 1 uop for the front-end, and a data dependency on the input. Branching has the possible upside of using the result now and verifying it was correct when the bool is actually available (branch prediction + speculative exec), but has the huge downside that branch mispredicts stall the pipeline for ~15 cycles, and wastes any work done since the branch. Unless it's *very* predictable, movzx is much better — Peter Cordes, Apr 07 '21 at 19:48
@canton7: If branching was better in general, C and C++ compilers would do that, but they don't. They use `movzx` when widening a bool to an int, because [the object-representation of a `bool` is guaranteed by the ABI to be `0` or `1`](https://stackoverflow.com/questions/54120862/does-the-c-standard-allow-for-an-uninitialized-bool-to-crash-a-program/54125820#54125820). The asm you get from current C# compilers is insane; even if C# implemented bool as 0 / non-zero (not necessarily 1), I'd expect `xor eax, eax` / `test cl,cl` / `setnz al` to "booleanize" a 0 / non-zero into an int 0 or 1 — Peter Cordes, Apr 07 '21 at 19:51
@PeterCordes : I know not all the opcodes were one byte. Just want to make the explantion simple. — jdweng, Apr 07 '21 at 19:59
@PeterCordes Note that the vaguely optimized approaches in .NET use `movzx` as well -- they just have a couple of `mov`s first. Remember that GCC has a lot more time to perform optimizations than the JIT does, so expect more optimal code for things like this. Also remember that the JIT is now 2-tier, and this may get better treatment if it ends up on a hot path. — canton7, Apr 07 '21 at 20:01
@jdweng: Your explanation goes *way* beyond "simple" and into "completely wrong" territory. There's a fundamental difference between a conditional branch (control dependency) and a `movzx` (data dependency), and the performance model implied by your claims (like that front-end throughput depends on how many instructions fit in an 8-byte chunk of machine code) are also complete nonsense. — Peter Cordes, Apr 07 '21 at 20:03
@PeterCordes : I've designed pipelines and ananlyzed the troughput and compared against actual measured results. — jdweng, Apr 07 '21 at 20:06
@canton7: I don't actually use C#, I just like playing with asm, so thanks for the info about its 2-stage JIT. Do you know if SharpLab is only showing "first stage" JIT? In a couple of Hrant's recent SO questions, we've certainly seen some really silly code-gen from it, so it would be good to hear that hot code paths might not be running code that bad. (But yeah, JITs in general, including Java Hotspot, have to compile fast and thus don't look for as many peepholes and other optimizations as ahead-of-time compilers.) — Peter Cordes, Apr 07 '21 at 20:48
@PeterCordes My benchmark was executing the same function for literally `5000000000` times. And still I could see a difference there (see the Notes section). Does the benchmark not prove that it didn't get optimized for hot paths? I also Included the benchmarking snippet. Maybe someone can spot a mistake there. — , Apr 07 '21 at 20:58
Yuck, any self-respecting C compiler would totally discard that loop body! You call with compile-time constant args (allowing full constant propagation), and you don't even use the results! If there's still a perf difference between versions with that, then yeah it's pretty likely the JIT is leaving that useless code in there. I just updated my answer with a real use-case like `total += H(a==b);` and the asm looks reasonable; that's what I would have tested, or maybe `total += H(a & i == 0)` in a loop to give a compiler something dependent on the loop counter in an interesting way. — Peter Cordes, Apr 07 '21 at 21:16

Peter Cordes · Accepted Answer · 2021-04-07T21:09:13.483

It's not all that surprising that reading 4 bytes from a bool generates code that spills to memory first, then reloads, because that's a weird thing to do.

If you're going to mess around with unsafe pointer-casting for type-punning, surely you should be reading the bool into a same-sized integral type like unsigned char or uint8_t or whatever equivalent C# has, and then casting (or implicitly converting) that narrow type to int. Apparently that's Byte.

using System;
static unsafe int H(bool b) 
{
    return *(Byte*)&b;         
}

asm on Sharplab, and see below for this inlining into a caller for H(a == b).

<Program>$.<<Main>$>g__H|0_0(Boolean)
    L0000: mov eax, ecx
    L0002: ret

So apparently the ABI / calling convention passes narrow args like "bool" sign- or zero-extended to 32-bit already. Or else this is more unsafe than I realize and will actually result in int values that aren't 0 or 1!

We get a movzx-load if we take a pointer-to-bool that's not already in a register:

static unsafe int from_mem(bool *b) 
{
    return *(Byte*)b;
}

<Program>$.<<Main>$>g__from_mem|0_1(Boolean*)
    L0000: movzx eax, byte ptr [rcx]
    L0003: ret

Re: performance benefit

There were some questions raised in comments about which is actually better. (And some nonsensical performance claims about code-size and front-end fetch which I replied to in comments.)

If branching was better in general, C and C++ compilers would do that, but they don't. This is very much a missed optimization in current C# implementations; that branching asm is insane, IMO. Possibly / hopefully that goes away with 2nd-stage JITing for hot code-paths, in which case messing around with unsafe could make things worse. So there is some merit to testing real use-cases.

movzx eax, cl has zero latency on current Intel CPUs (Can x86's MOV really be "free"? Why can't I reproduce this at all?), or 1 cycle latency on AMD. (https://uops.info/ and https://agner.org/optimize/). So the only cost is 1 uop for the front-end, and a data dependency on the input. (i.e. the int value isn't ready for use by later instructions until the bool value is ready, like with normal operations such as +)

Branching has the possible upside of using the result now and verifying it was correct when the bool is actually available (branch prediction + speculative exec break the data dependency), but has the huge downside that branch mispredicts stall the pipeline for ~15 cycles, and wastes any work done since the branch. Unless it's very predictable, movzx is much better.

The most likely case for "very predictable" would be a value that never changes, in which case reading it should be cheap (unless it misses in cache) and out-of-order exec can do that nice and early, which would make movzx good and avoid using up space in the CPU's branch-predictor unnecessarily.

Branching on a bool to create a 0 / 1 is basically using branch prediction to do value-prediction. It's certainly possible that it's a good idea in a few rare cases, but it's not what you want by default.

C and C++ compilers can use movzx when widening a bool to an int because the object-representation of a bool is guaranteed / required by the ABI to be 0 or 1. I assume that's the case in most C# implementations as well, not just a byte with 0 / some-non-zero value that might not be 1.

(But even if you did have an arbitrary non-zero value, the normal way to booleanize it to 0 / 1 is xor eax, eax / test cl,cl / setnz al. i.e. to implement int retval = !!x for an integer byte x.)

Real use-case when inlining:

static int countmatch(int total, int a, int b) {
    //return total + (a==b);   // C
    return total + H(a == b);
}

Sharplab

<Program>$.<<Main>$>g__countmatch|0_2(Int32, Int32, Int32)
    L0000: cmp edx, r8d
    L0003: sete al
    L0006: movzx eax, al
    L0009: add eax, ecx
    L000b: ret

Pretty normal code-gen; what you'd expect from a C compiler, just one missed optimization: should use xor eax,eax / cmp / sete al to take the movzx zero-extension off the critical path for latency. (AL and EAX being part of the same register mean that even on Intel CPUs, mov-elimination doesn't apply). Clang, gcc, and MSVC do this (https://godbolt.org/z/E9fKhh5K8), although older GCC sometimes has trouble avoiding the movzx in other more complicated cases, perhaps minimizing register pressure.

Sharplab doesn't seem to have AArch64 output to let you see if it can compile to cmp w1, w2 / cinc w0, w0, eq like C compilers do. (As well as conditional-select, ARM64 provides a csinc conditional select-increment, which it uses with the zero-register to build cset (x86 setcc) and cinc (add a FLAG condition).) I wouldn't be too optimistic; I'd guess probably still materializing a bool into a register and adding it.

static int countmatch_safe(int total, int a, int b) {
    return total + Convert.ToInt32(a == b);
}

Without unsafe in C#, the silly code-gen inlines and still materialized a boolean for add, instead of branching around an inc. This is even worse than if(a==b) total++; which does compile the way you'd expect.

<Program>$.<<Main>$>g__countmatch_safe|0_3(Int32, Int32, Int32)
    L0000: cmp edx, r8d
    L0003: je short L0009
    L0005: xor eax, eax
    L0007: jmp short L000e
    L0009: mov eax, 1
    L000e: add eax, ecx
    L0010: ret

score 0 · Answer 2 · answered Apr 07 '21 at 12:05

0

In a system where the conversion could have an impact the most efficient way is to not convert and to keep trues and falses as target type (int or byte, etc.).

answered Apr 07 '21 at 12:05

tymtam

31,798
8
86
126

You're absolutely right, the problem is: I can NOT pass an expression like: `a == 1` in my parameter when I have `byte` there. – Apr 07 '21 at 12:09
You can pass `1-a` which gives you `not(a == 1)` – tymtam Apr 07 '21 at 12:13
That was an example. I want to say that it'll be hard to pass a `bool` there (not hard but you have to go into conversion). Think about `str == "ABCD"`. This becomes already hard to do with bytes. Am I wrong here? – Apr 07 '21 at 12:15
2

If you work with strings then optimising cycles spent on converting int to bool just doesn't have a measurable impact. – tymtam Apr 07 '21 at 12:21