Are C++11+ compilers smart enough to optimize inner "if" statements?

Question

This is best described with code:

void foo(vector<vector<int>>& a, vector<vector<int>>& b, bool flag) { 

    vector<vector<int>> c; 
    for (int i ...) { 
        for (int j ...) { 
            int value; 
            if (flag) 
                value = a[i][j] + b[i][j]; 
            else 
                value = a[i][j] - b[i][j];
        } 

    } 
}

At face value, the flag is evaluated and branched on every single inner loop, despite being known before either loop. Will a C++11+ compiler generate two separate code paths, evaluating the branch at the beginning, or should this be manually done?

Before lecturing me on premature optimization, please understand this is asked in an effort to be a more cognizant programmer regarding minor details.

why not try it yourself? gcc optimises your code to nothing so will be very fast: https://godbolt.org/z/baPjYv — Alan Birtles, Mar 05 '21 at 08:24
The answer to pretty much every question of this kind is "yes". — Piotr Siupa, Mar 05 '21 at 08:28
@Evg: I made sure I understood that this question is asked in an effort to be a more cognisant programmer regarding minor details before posting up that link. — Bathsheba, Mar 05 '21 at 09:14
the most important thing to realize is that there is no one answer that fits all. Details do matter. As Alan mentioned, your code as posted is effectively not doing anything and a compiler is able to see that. Details do matter and if you care about performance there is no alternative to try and see what the compiler produces as output and to measure — 463035818_is_not_an_ai, Mar 05 '21 at 10:21
I would like to add that even if the compiler doesn't optimize that, the performance penalty will be very small. Because [branch prediction](https://stackoverflow.com/a/11227902) done by the CPU will be extremely good is that case, as the condition is constant throughout the double loops. — prapin, Mar 05 '21 at 21:16

score 1 · Answer 1 · answered Mar 05 '21 at 08:40

This probably depends on complexity of your example, but compilers are capable of that kind of optimization. Let's have look at a simple and complete example:

extern bool get_bool() noexcept;
extern int get_int() noexcept;
extern void foo1() noexcept;
extern void foo0() noexcept;

void foo() noexcept {
  bool b = get_bool();
  int i_mx = get_int();
  int j_mx = get_int();

  for (int i = 0; i < i_mx; ++i) {
    for (int j = 0; j < j_mx; ++j) {
      if (b)
        foo1();
      else
        foo0();
    }
  }
}

If we compile this with clang, here is the generated code:

foo():                                # @foo()
        push    rbp
        push    r15
        push    r14
        push    r12
        push    rbx
        call    get_bool()
        mov     r14d, eax
        call    get_int()
        mov     r15d, eax
        call    get_int()
        test    r15d, r15d
        jle     .LBB0_9
        mov     r12d, eax
        test    eax, eax
        jle     .LBB0_9
        xor     ebx, ebx
        test    r14b, r14b
        je      .LBB0_3
.LBB0_6:                                # =>This Loop Header: Depth=1
        mov     ebp, r12d
.LBB0_7:                                #   Parent Loop BB0_6 Depth=1
        call    foo1()
        dec     ebp
        jne     .LBB0_7
        inc     ebx
        cmp     ebx, r15d
        jne     .LBB0_6
        jmp     .LBB0_9
.LBB0_3:                                # =>This Loop Header: Depth=1
        mov     ebp, r12d
.LBB0_4:                                #   Parent Loop BB0_3 Depth=1
        call    foo0()
        dec     ebp
        jne     .LBB0_4
        inc     ebx
        cmp     ebx, r15d
        jne     .LBB0_3
.LBB0_9:
        pop     rbx
        pop     r12
        pop     r14
        pop     r15
        pop     rbp
        ret

It's clear that test r14b, r14b line is moved outside the loops. Again, your mileage may vary depending on the complexity of your code. Better check the generated assembly to be sure.

score 0 · Accepted Answer · answered Mar 05 '21 at 10:12

There is little in the way of optimization in even the newest C++ standard today (2021), e.g. copy elision and return value optimization.

This leaves 'compilers' to apply any platform-specific optimizations.

The function in the question produces no effect and is thus most likely optimized away completely.

But to address (what I am assuming is) the 'underlying' question, a typical compiler will be able to infer that the same condition is applied even to a nested loop, e.g.

int foo(vector<vector<int>> a, vector<vector<int>> b, bool flag) { 
    int value = 0;
    for (int i = 0; i < a.size(); i++) { 
        for (int j = 0; j < a[i].size(); j++) { 
            if (flag) 
                value += a[i][j] + b[i][j]; 
            else 
                value += a[i][j] - b[i][j];
        } 

    } 
    return value;
}

Assembly code below says 'yes, it is optimized' (generated by Clang Intel x86 64-bit compiler):

notice third argument foo found in register dl (an 8-bit version of 64-bit register RDX) is tested before the two loops start
both loops have been duplicated based on the 'foo' condition: LBB1_2 and LBB1_6

This (redacted) assembly code as generated by running: g++ -std=c++17 -O3 -c -S code.cpp:

__Z3fooNSt3__16vectorINS0_IiNS_9allocatorIiEEEENS1_IS3_EEEES5_b: ## @_Z3fooNSt3__16vectorINS0_IiNS_9allocatorIiEEEENS1_IS3_EEEES5_b
        .cfi_startproc
## %bb.0:
        pushq   %rbp
...
        xorl    %eax, %eax    # <======== int value = 0;
        testb   %dl, %dl      # <======== if (flag)
        jne     LBB1_6
        jmp     LBB1_2
        .p2align        4, 0x90
LBB1_7:                                 ##   in Loop: Header=BB1_6 Depth=1
        incq    %r10
        cmpq    %r10, %r8
        jbe     LBB1_26
LBB1_6:                                 ## =>This Loop Header: Depth=1
                                        ##     Child Loop BB1_13 Depth 2
                                        ##     Child Loop BB1_17 Depth 2
        leaq    (%r10,%r10,2), %rcx
        movq    (%r9,%rcx,8), %rdi
        movq    8(%r9,%rcx,8), %r11
        subq    %rdi, %r11
...
LBB1_2:                                 ## =>This Loop Header: Depth=1
                                        ##     Child Loop BB1_21 Depth 2
                                        ##     Child Loop BB1_5 Depth 2
        leaq    (%r10,%r10,2), %rcx
        movq    (%r9,%rcx,8), %rdi
        movq    8(%r9,%rcx,8), %r11
        subq    %rdi, %r11

score 0 · Answer 3 · answered Mar 05 '21 at 11:35

optimization of this stuff will look like this

#include <iostream>
#include <vector>
#include <execution>


int main(int argc , char *argv[])
{
    std::vector<std::vector<int>> b{{2,3} ,{4,7}};
    std::vector<std::vector<int>> aq{{7,8} ,{2,17}};

    std::for_each(std::execution::par, b.begin(), b.end() ,[&](auto& x){
                    for (auto& w : x)
                        w = aq[&x - b.data()][&w - x.data()];
                });
    std::cout << "and then " << b[0][1] << std::endl;
}

-ltbb -std=c++20

Are C++11+ compilers smart enough to optimize inner "if" statements?

3 Answers3