Short Circuit Boolean operators in C (again): doesn't it produce inefficient code by default?

Question

I know that the topic "short circuit operators" has been discussed a lot here on StackExchange. But w.r.t specifically the 'C' language, there is an aspect that bothers me for quite a while.

Consider some C code like

if (b && f(&s)) {...}

where b is of boolean type, s is some variable that has the type of some more or less elaborate struct and f is a getter to the struct. A more concrete example may look like:

if (active && isEmpty(&list)) {...}

Usually, I would expect that any decent compiler will inline the isEmpty function and optimize the whole expression to a single AND assembler instruction with the first operand already held in a data register and the second operand in memory, accessed via 'address register with offset' addressing mode (or something similar).

However, since the compiler is enforced (by the language rules) to generate the short circuit form, this kind of optimization must not be done and would ultimately produce drastically inefficient code. It is hard for me to believe that this is actually happening!

What version of assembler code will an optimizing C compiler usually generate?

Nothing prevents the compiler to inline the right-hand side anyway. And think about the case where the right-hand side is an expensive operation, would you still want it to happen even if the left-hand side is false? Or the common case of checking for null-pointers, where the check is on the left-hand side and dereferencing the pointer on the right-hand side, without short-circuit evaluation that would not be possible. — Some programmer dude, May 04 '22 at 06:34
Well, it depends on your compiler and you haven't tagged which one you are interested in. With gcc, use -S and the various optimization levels to see what happens. — Allan Wind, May 04 '22 at 06:34
The compiler isn't forced to generate short-circuit code. It has to generate code that _acts as if_ it did short-circuit evaluation. If it can prove that evaluating the second part is always well defined doesn't have side effects for instance, it could do exactly what you describe. — Mat, May 04 '22 at 06:35
"drastically inefficient code" is a weird way to spell "avoids over half the work whenever it can, and pays the cost of a single predictable branch (e.g. very little) in cases where the first test reliably passes". Even at the worst point in overly deep pipelines (Pentium 4 era), the cost of a predictable branch was quite cheap. — ShadowRanger, May 04 '22 at 07:06
Beyond that, if you *want* to avoid the extra branch, you can always write `if (active & isEmpty(&list)) {...}` and get your desired behavior. Why make `if (cheaptest_that_false_90pct_of_the_time && expensivetest())` take forever just to avoid a branch that, as noted, is not nearly as bad as you seem to think it is? — ShadowRanger, May 04 '22 at 07:10
Side-note: Short-circuiting is 50+ years old. "boolean" types are less than 25 years old. 25 years ago, a "boolean" was an `int`, and could easily have a value that was neither `0` nor `1`, and `if (a & b)` could be false even if both `a` and `b` evaluated to true. So no, it can't just replace `&&` with `&`. — ShadowRanger, May 04 '22 at 07:13
https://stackoverflow.com/questions/26124620/why-does-msvc-emit-a-useless-movsx-before-performing-this-bit-test — Hans Passant, May 04 '22 at 07:57
@Someprogrammerdude *Would you still want it to happen even if the left-hand side is false?* No, of course not, quite the opposite: often, the correctness of the code *relies* on the fact that the right-hand side is not evaluated (as in the null-pointer example). But using the short circuit to avoid an expensive operation is bad design IMHO. — Hartmut Braun, May 04 '22 at 10:26
@AllanWind I disagree: this is C language standard and thus independent of the compiler in use. — Hartmut Braun, May 04 '22 at 10:27
@Mat the problem are not side effects. That is something the optimizer can figure. The problem is the dependence of correctness of evaluation of the right-hand side depending on the left-hand side, like when checking for null-pointer before dereferencing it. This is a problem that would require understanding the semantic of my code and exceeds the capabilities of optimizers yet, doesn't it? — Hartmut Braun, May 04 '22 at 10:32
@ShadowRanger No, that's not the point of my question. My problem is: how can the compiler produce efficient code in simple expression that are not expensive. As already stated in another comment: using short-circuit to avoid expensive operations is bad design IMHO. — Hartmut Braun, May 04 '22 at 10:33
@HartmutBraun: no, optimizers are quite capable of analyzing that - not in all cases of course, but in practical cases they certainly can. If you haven't watched it already, look for the "What has my compiler done for me lately?" talk by Matt Godbolt. — Mat, May 04 '22 at 10:34
@HartmutBraun What exactly do you disagree with? short circuit is a well defined behavior, but I don't recall the standard talking about inlining in this case. If you have a section reference I would be happy to check it out. — Allan Wind, May 04 '22 at 19:18
@AllanWind I disagree with your claim that an answer to my question requires to name a specific compiler. My question is about consequences of the required behaviour of short-circuit-operators as specified in the language reference manuals. On that ground, Eric gave a perfectly valid answer and I accepted it. — Hartmut Braun, May 05 '22 at 09:18

Lundin · Answer 1 · 2022-05-04T14:22:09.523

3

Yes it is slightly inefficient but not because of inlining, but because instruction re-ordering isn't possible. In case the left operand is the most execution-heavy one, it still has to be executed first.

For example consider search_algorithm() && bool_flag, the slow function must always get executed, even though the bool flag check is much quicker and in case it would evaluate to zero, the whole expression is false. The compiler can't do anything about that, we are forced to rely on manual optimization by swapping the operands.

Inlining may happen regardless of where the function call happens in an expression. It's not the inlining in itself that's blocked, it's the execution of the code in the function.

The &&/|| operators could potentially also mean that an extra branch gets introduced in the machine code, since if(a && b) is equivalent to if(a) { if(b) { ... } }.

Inlining example for x86:

#include <stdio.h>

int afunc (int a)
{
  puts(__func__);
  return a;
}

int bfunc (int b)
{
  puts(__func__);
  return b;
}

_Bool and (int a, int b)
{
  return afunc(a) && bfunc(b);
}

gcc -O3 generates this code for the and function:

afunc:
        push    r12
        mov     r12d, edi
        mov     edi, OFFSET FLAT:__func__.1
        call    puts
        mov     eax, r12d
        pop     r12
        ret
bfunc:
        push    r12
        mov     r12d, edi
        mov     edi, OFFSET FLAT:__func__.0
        call    puts
        mov     eax, r12d
        pop     r12
        ret
and:
        push    rbp
        mov     ebp, esi
        push    rbx
        mov     ebx, edi
        mov     edi, OFFSET FLAT:__func__.1
        sub     rsp, 8
        call    puts
        xor     eax, eax
        test    ebx, ebx
        jne     .L12
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret
.L12:
        mov     edi, OFFSET FLAT:__func__.0
        call    puts
        test    ebp, ebp
        setne   al
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret
__func__.0:
        .string "bfunc"
__func__.1:
        .string "afunc"

The relevant part is the and: function:

It inlines the function call by directly loading the string address OFFSET FLAT:__func__.1 corresponding to "afunc" into a register and then calls puts.
It checks the value of a here: test ebx, ebx and then introduces the extra branch jne .L12 based on the outcome. .L12 being the right operand evaluation.

Both functions are inlined. They are executed left-to-right and the right operand is only evaluated in case "jump not equal to zero" leads to a jump, otherwise the function returns. clang and icc gives nearly identical machine code.

edited May 04 '22 at 14:22

answered May 04 '22 at 07:02

Lundin

195,001
40
254
396

Yes I understand all you say but that's not exactly the point. You say that the operators could *potentially* introduce an extra branch in machine code. But for me it seems that it is almost *always* forced to do so by the language standard even in moderately complex expressions like the example in my question. In addition, I'm inclined to assume that some compilers may still optimize it, violating the language requirements. – Hartmut Braun May 04 '22 at 10:50
@HartmutBraun "I'm inclined to assume that some compilers may still optimize it" Sounds like a conspiracy theory. I've never seen it, you'll need to provide evidence. On the contrary, in my experience introducing sequence points in C code always tends to result in different machine code generated. – Lundin May 04 '22 at 13:50
No conspiracy. Compiler builder are humans. Compilers produce loads of errors in particular while optimising. I work in safety critical software and we need to be aware of the problems our compilers have. That’s why I’m interested in that topic. – Hartmut Braun May 04 '22 at 14:02
1

@HartmutBraun I work in safety-critical software too and I have never observed the behavior you speculate about. Not even in various embedded systems compilers of diverse quality. The sequence point in `&&` has been there since the first C standard was released. 33 years later you claim that compilers are suddenly non-compliant. So it's a conspiracy theory until you have shown actual proof in the form of disassembly. – Lundin May 04 '22 at 14:09
@HartmutBraun I added an example where inlining and evaluation order are both carried out as expected. – Lundin May 04 '22 at 14:22
I think you misunderstood my comment on errors in compilers (you even seem to be offended...?) Compilers are software, they *do* have errors. Some of them lead to erroneous machine code; I analyzed 12 problem reports for an Ada compiler myself. These errors are rare and only occur for very specific combinations of specific code constructs in specific contexts. A lot of those problems were related to optimization.https://stackoverflow.com/questions/2722302/can-compiler-optimization-introduce-bugs – Hartmut Braun May 04 '22 at 16:17
@HartmutBraun Yes they have plenty of errors, as seen from SO. We find some ~20-30 bugs in gcc or clang per year. However, this is very mature and peer reviewed software by now, any bugs encountered are almost always about special features, recent changes etc. The `&&` operator is not an exotic feature, it's very fundamental and well-known C. But claiming that something is broken with no proof and pure speculation/opinions is simply not how engineering works. – Lundin May 05 '22 at 06:32
Ok, ok, so be it: I retract my *inclination* to claim that… . This argument is off topic anyway w.r.t my question. So, back to the topic: based on your example (thanks again) and the answer by Eric (see my comment on it) it seems to me that indeed the short circuit operators results in machine code that is not as good as it could get, at least to some extent. – Hartmut Braun May 05 '22 at 09:09
1

@HartmutBraun You can create such a "non-short-circuit AND operator" yourself: `(_Bool)expr1 & (_Bool)expr2`. Now the order of evaluation is no longer well-defined and the compiler can optimize or sequence the expressions as it pleases. Check out this modified example https://godbolt.org/z/r69hMqG33. The `and_fast` doesn't come with the extra branch. You'll need to make a good argument before rolling out these kind of micro-optimizations though, as they make the code harder to read and understand. Also this functionality isn't identical since the `and_fast` may evaluate both expressions. – Lundin May 05 '22 at 09:51
@Lundin: Also note: As your own GodBolt link shows, x86 does more work than you'd think to coerce to normalized booleans (with a `test`+`setne` for each) and perform the `and` test; it's not branching, so if the branch predictor gets it wrong regularly (it's usually ~90% correct), branchless will win, but short-circuiting means a predicted branch in `&&` code executes either 14 instructions (lower cost than branchless) or 18 instructions (similar cost to branchless) depending on the first test, while the `(_Bool)expr1 & (_Bool)expr2` code always executes 18 instructions. – ShadowRanger May 17 '22 at 13:19

score 1 · Accepted Answer · answered May 04 '22 at 16:09

What version of assembler code will an optimizing C compiler usually generate?

This is an evolving question, and compilers have gotten better and better over the years. Good programmers become familiar with the capabilities of their compilers and write code that takes advantage of it. Thus, if active && isEmpty(&list) versus isEmpty(&list) && active is an issue, a good programmer will choose the one that is better for their situation, and/or they may dive into the implementation of isEmpty and see if it can be improved for this.

However, since the compiler is enforced (by the language rules) to generate the short circuit form,…

This is true and not true. The C standard specifies behavior on two levels. The semantics of C are specified using a model of an abstract computer that explicitly does what the standard says. For &&, it says “… the && operator guarantees left-to-right evaluation…,” so it is true that is what the abstract computer does in the model.

But a C implementation does not have to implement that model literally. On the “what is really implemented” level, instead of on the model level, a C implementation only has to match the observable behavior of the model, which is (C 2018 5.1.2.3 6):

Accesses to volatile objects are evaluated strictly according to the rules of the abstract machine.
At program termination, all data written into files shall be identical to the result that execution of the program according to the abstract semantics would have produced.
The input and output dynamics of interactive devices shall take place as specified in 7.21.3. The intent of these requirements is that unbuffered or line-buffered output appear as soon as possible, to ensure that prompting messages actually appear prior to a program waiting for input.

Now suppose the C compiler can see the implementation of isEmpty, as it would when inlining it. In this case, it can see all the effects of active && isEmpty(&list). If active && isEmpty(&list) contains no accesses to volatile objects and no writing to files or interacting with devices, then the actual implementation of the expression can be in any order at all, including any ordering of any subparts inside isEmpty, providing the reordering computes the correct result. So, in the actual implementation, it is not true the compiler must guarantee the left-to-right ordering; it merely needs to guarantee the same result.

That correct result is an issue. Maybe active guarantees that evaluation of isEmpty(&list) will not use some null pointer, so active needs to be evaluated “first.” However, if the compiler is inlining isEmpty, it could evaluate parts of isEmpty first, as long as it ensures none of its evaluations cause a problem if active is false.

One thing that can block these sorts of optimizations is that isEmpty might use other functions or objects that the compiler does not have a complete view of, so the compiler is forced to implement the abstract computer more literally in certain respects. This goes back to the good programmer issue—if this part of the program matters, a good programmer will consider their interactions here between the code for isEmpty and what the compiler capabilities are. And that will likely change over the coming years.

Great. The important keyword here is *volatile*. I take this to mean that for simple expressions probably no branches will be generated. Looking at the example provided by @Lundin, it seems to me that the existence of a „non-short-circuit“ operator could have provided valuable information (about the volatility) that would have allowed the optimiser to collapse the „and“-function almost completely. Instead, even with -O3, it generates branches. — Hartmut Braun, May 05 '22 at 07:08

Short Circuit Boolean operators in C (again): doesn't it produce inefficient code by default?

2 Answers2