1

I recently heard of the idea of branchless programming and I want to give it a try and see if it can boost performance. I have the following C function.

int square(int num) {
    int result = 0;
    if (num > 10) {
        result += num;
    }
    return result * result;
}

After removed the if branch, I have this:

int square(int num) {
    int result = 0;
    int tmp = num > 10;
    result = result * tmp + num * tmp + result * !tmp;
    return result * result;
}

Now I want to know whether the branchless version if faster. I searched around and found a tool called hyperfine (https://github.com/sharkdp/hyperfine). So I wrote the following main function and test the two versions of the square function with hyperfine.

int main() {
    printf("%d\n", square(38));
    return 0;
}

The problem is that based on the hyperfine result, I can't determine which version is better. In C programming, how does people usually determine which version of a function is faster?

Below is some of my hyperfine result.

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.4 ms ±   0.2 ms    [User: 2.2 ms, System: 3.2 ms]
  Range (min … max):     4.9 ms …   6.1 ms    230 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       6.1 ms ±   0.7 ms    [User: 2.2 ms, System: 3.7 ms]
  Range (min … max):     5.0 ms …   9.7 ms    225 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.5 ms ±   0.3 ms    [User: 2.1 ms, System: 3.5 ms]
  Range (min … max):     4.9 ms …   7.0 ms    211 runs

C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.6 ms ±   0.4 ms    [User: 2.0 ms, System: 3.9 ms]
  Range (min … max):     4.8 ms …   7.0 ms    217 runs

  Warning: Command took less than 5 ms to complete. Results might be inaccurate.


C:\my_projects\untitled>hyperfine branch.exe
Benchmark #1: branch.exe
  Time (mean ± σ):       5.7 ms ±   0.3 ms    [User: 1.9 ms, System: 4.0 ms]
  Range (min … max):     5.0 ms …   6.6 ms    220 runs

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.6 ms ±   0.3 ms    [User: 1.9 ms, System: 3.9 ms]
  Range (min … max):     4.8 ms …   6.9 ms    219 runs

C:\my_projects\untitled>hyperfine branchless.exe
Benchmark #1: branchless.exe
  Time (mean ± σ):       5.8 ms ±   0.3 ms    [User: 1.5 ms, System: 4.0 ms]
  Range (min … max):     5.2 ms …   7.3 ms    224 runs

C:\my_projects\untitled>
Peter Cordes
  • 328,167
  • 45
  • 605
  • 847
Just a learner
  • 26,690
  • 50
  • 155
  • 234
  • `how does people usually determine which version of a function is faster?` In such simple cases, look at the generated assembly. Note that you are not benchmarking code alone, but benchmarking a combination of compiler+compiler options+code. – KamilCuk Aug 05 '20 at 10:09
  • Take a look into the assembly, there you can see how much instructions the two code examples take. For this minimal code i think (dependend on platform you are working) every assembler command will take one clock cycle. – sunriax Aug 05 '20 at 10:10
  • You are measuring the performance of repeated executions of `square(38)`. That is pointless, as a program that would need the value of `square(38)` might just as well use `1444`, which may be a lot faster than either of your functions (depending on compiler optimizations). You need to measure with a distribution of values, similar to values you might see in the program that needs this function, and in a similar order (to take into account the effects of branch prediction). – Kris Vandermotten Aug 05 '20 at 10:14
  • 4
    Secondly, it seems you're measuring the execution time of `printf` as well, which will most likely be orders of magnitude higher than the execution time of your function. That will hide the effects of any difference between the functions. It's like trying to see the light of a distant star next to the sun: it's invisible. – Kris Vandermotten Aug 05 '20 at 10:22
  • 1
    Why not just doing `int tmp = num > 10; return num * num * tmp;`? This is simpler and probably faster. Moreover, conditional jumps are slow only when they are hard to predict on modern processors. However, as `num > 10` is always true in computing `square(38)`, the version with branches should be fast. – Jérôme Richard Aug 05 '20 at 10:24
  • `result += (-(num > 10)) & num;` is also branch-less and will produce the same results. – Roy Avidan Aug 05 '20 at 10:37
  • *how does people usually determine which version of a function is faster?* **You don't bother**. You'd be wasting time. You write clear, maintainable code. First, you are not likely to outsmart your compiler's optimizations. Second, what you think looks faster has a really good chance of actually being **slower**, oftentimes because the compiler couldn't figure out how to optimize obtuse code. Third, what you think are the lines of code that are going to be bottlenecks **probably aren't**. If your final product doesn't meet requirements, Then you **profile it** and fix the actual bottlenecks. – Andrew Henle Aug 05 '20 at 11:13
  • generally speaking gcc is not good in this kind optimisations. clang does it better: https://godbolt.org/z/oqojcx – 0___________ Aug 05 '20 at 12:15
  • 1
    @KrisVandermotten: Even more importantly, they're measuring the total time for an entire process to start up and exit!! That will include a page fault or two normally, and dynamic linking. Using `printf` is probably a significant part of that, maybe especially on Windows in a slow terminal window, but yeah, totally insane. Not quite a duplicate of the more generic [Idiomatic way of performance evaluation?](//stackoverflow.com/q/60291987), but that points out several of the methodology problems, and the fundamental flaw in trying find a simple 1-dimensional cost for something this short. – Peter Cordes Aug 05 '20 at 16:05

2 Answers2

3

How to benchmark a few lines of C programming code?

Compile the code and inspect generated assembly by your compiler.

It's typical to use godbolt and inspect the generated assembly there. Godbolt link.

A semi-not-really-reliable way is to count the assembly instructions executed. I do not know about windows - I work on linux. With gdb I use code presented in this question and with:

// 1.c
#if MACRO
int square(int num) {
    int result = 0;
    if (num > 10) {
        result += num;
    }
    return result * result;
}
#else
int square(int num) {
    int result = 0;
    int tmp = num > 10;
    result = result * tmp + num * tmp + result * !tmp;
    return result * result;
}
#endif
// start-stop places for counting assembly instructions
// Adding attribute and a specific asm syntax that is a GNU extension
// So that the compiler will not optimize the functions out
__attribute__((__noinline__)) void begin() { __asm__("nop"); }
__attribute__((__noinline__)) void finish() { __asm__("nop"); }
// trying to use volatile so that compiler 
// wouldn't optimize the function completely out
volatile int arg = 38, res;
int main() {
    begin();
    res = square(arg);
    finish();
}

Then compile and benchmark in bash:

# a short function to count number of instructions executed between "begin" and "finish" functions
$ b() { printf "%s\n" 'set editing off' 'set prompt' 'set confirm off' 'set pagination off' 'b begin' 'r' 'set $count=0' 'while ($pc != finish)' 'stepi' 'set $count=$count+1' 'end' 'printf "The count of instruction between begin and finish is: %d\n", $count' 'q' | gdb "$1" |& grep 'The count'; }

# then compile and measure
$ gcc -D MACRO=0 1.c ; b a.out
The count of instruction between begin and finish is: 34
$ gcc -D MACRO=1 1.c ; b a.out
The count of instruction between begin and finish is: 22

Looks like on my platform with gcc10 compiler without any options without optimizations the second version is executing 12 instructions short. But comparing compiler output with optimizations makes no sense. After enabling optimizations there is a difference of one instruction:

$ gcc -O -D MACRO=0 1.c ; b a.out
The count of instruction between begin and finish is: 11
$ gcc -O -D MACRO=1 1.c ; b a.out 
The count of instruction between begin and finish is: 10

Notes:

  • with your code square(38) can be just optimized to no-op.
  • with your code and hyperfine branchless.exe you are comparing the execution of printf ie. the time it takes to flush the output and print it, not the execution of square().
  • As presented in that answer you can use a hardware counter when available.
KamilCuk
  • 120,984
  • 8
  • 59
  • 111
  • 3
    Please don't say "12 instructions faster". Say "12 instructions shorter", and leave the extrapolation to "faster" to the reader if they want to make that leap. It's *often* true that without memory bottlenecks, fewer instructions executed = faster because of bottlenecks on the front-end, but really performance on this small a scale has 3 dimensions: front-end uops, latency from inputs to outputs, and which back-end execution-ports it needs. Most instructions are 1 uop and similar speed, but exceptions include `div`, and `imul` has 3 cycle latency. – Peter Cordes Aug 05 '20 at 15:56
  • 2
    See [What considerations go into predicting latency for operations on modern superscalar processors and how can I calculate them by hand?](https://stackoverflow.com/q/51607391) for more detail. Also [How many CPU cycles are needed for each assembly instruction?](https://stackoverflow.com/a/44980899) - as I said, that question doesn't have an answer. You can't assign a single number to each instruction and add up costs; superscalar out-of-order execution doesn't work that way. You have to figure out which of front-end, latency, or a single execution port are the bottleneck for a given loop – Peter Cordes Aug 05 '20 at 15:56
1

As said in remark the execution time of printf is greater that the time you want to measure, and independently of that the time you try to measure is too small.

To have a measure you have to put square in a file and its call in an other file in a loop, also not using literals, else the generated code can be directly the result and nothing more (never underestimate the power of the optimization the compilers are able to do when they know all, e.g. C++ constexpr).

So for instance :

file c1.c

int square(int num) {
    int result = 0;
    if (num > 10) {
        result += num;
    }
    return result * result;
}

file c2.c

int square(int num) {
    int result = 0;
    int tmp = num > 10;
    result = result * tmp + num * tmp + result * !tmp;
    return result * result;
}

file main.c

#include <stdio.h>

extern int square(int);

int main(int argc, char ** argv)
{
  int n, v, r = 0;
  
  if ((argc == 3) && 
      (sscanf(argv[1], "%d", &n) == 1) &&
      (sscanf(argv[2], "%d", &v) == 1))
    while (n--)
      r += square(v);
  return r;
}

Using the first solution (without optimization) :

/tmp % gcc c1.c main.c 
/tmp % time ./a.out 1000000000 38
2.315u 0.000s 0:02.41 95.8% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
2.316u 0.000s 0:02.41 95.8% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
2.316u 0.000s 0:02.41 95.8% 0+0k 0+0io 0pf+0w
/tmp %

Using the second solution (without optimization) :

/tmp % gcc c2.c main.c 
/tmp % time ./a.out 1000000000 38
3.087u 0.000s 0:03.21 95.9% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
3.107u 0.000s 0:03.23 95.9% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
3.098u 0.000s 0:03.22 95.9% 0+0k 0+0io 0pf+0w
/tmp  %

So without optimization the second proposal needs more time, this is still the case even the difference is almost null between them compiling with optimization :

/tmp % gcc -O2 c1.c main.c
/tmp % time ./a.out 1000000000 38
1.337u 0.000s 0:01.39 95.6% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
1.336u 0.001s 0:01.39 95.6% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
1.343u 0.000s 0:01.39 96.4% 0+0k 0+0io 0pf+0w
/tmp % 
/tmp % 
/tmp % gcc -O2 c2.c main.c
/tmp % time ./a.out 1000000000 38
1.341u 0.000s 0:01.39 96.4% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
1.343u 0.000s 0:01.40 95.7% 0+0k 0+0io 0pf+0w
/tmp % time ./a.out 1000000000 38
1.339u 0.000s 0:01.39 95.6% 0+0k 0+0io 0pf+0w
/tmp %

I did under Linux but you can do the same under Windows using your tool to measure

For information the generated code with optimization are :

first way :

square:
.LFB0:
    .cfi_startproc
    movl    %edi, %edx
    xorl    %eax, %eax
    imull   %edi, %edx
    cmpl    $11, %edi
    cmovge  %edx, %eax
    ret

second way :

square:
.LFB0:
    .cfi_startproc
    xorl    %eax, %eax
    cmpl    $10, %edi
    setg    %al
    imull   %edi, %eax
    imull   %eax, %eax
    ret
    .cfi_endproc
bruno
  • 32,421
  • 7
  • 25
  • 37
  • 1
    measuring the performance without optimisations is rather pointless – 0___________ Aug 05 '20 at 12:02
  • @P__J__ for sure, I did to show the difference between with and without, and to show optimizations make the result less different that the sources are – bruno Aug 05 '20 at 12:11
  • 3
    The key point you could be making in your answer is that GCC optimized the `if` into branchless asm with `cmov`, which is more efficient than `setg`+`imul`. This is called "if-conversion". It happens most reliably with a ternary operator in the C source. – Peter Cordes Aug 05 '20 at 16:00
  • @PeterCordes sure this is interesting, but quite out of subject, can I recall you the question is how to benchmark a short code ? – bruno Aug 05 '20 at 17:11
  • 1
    This code is so short that its execution will overlap with the surrounding code, and even optimize into surrounding code at compile time, depending on what else that code does with the same vars. That part doesn't have a simple answer, see my comments on other answers and on the question. Calling it as a non-inline function with the same input repeatedly will measure its throughput but not its latency. The overall throughput bottleneck in your case includes call/ret overhead, which is non-negligible compared to how short this is. – Peter Cordes Aug 05 '20 at 18:14
  • Other than making the function unable to inline, you can also use inline asm to implement a `doNotOptimize` function, e.g. [Avoid optimizing away variable with inline asm](https://stackoverflow.com/q/44562871). See also [Idiomatic way of performance evaluation?](https://stackoverflow.com/q/60291987) for more about benchmark methodology in general, e.g. a warm-up loop. (Or if timing the whole process, run something else first to get the CPU up to speed.) – Peter Cordes Aug 05 '20 at 18:26
  • @PeterCordes yes the call itself is an important part, I preferred to let it because the OP given a full function rather than a sequence of statements. Also `while (n--) r +=` is not negligible, it is possible to measure the time with the same *main* but a third definition of *square* just returning its argument for instance, but after that there is also 'external' reasons for instance because of the cache etc etc. I really hope the OP reads your remarks – bruno Aug 05 '20 at 18:38
  • Right, yes, since you're measuring throughput not latency, loop overhead in the caller matters potentially matters, depending on whether those uops can get issued for free in front-end slots that would have been wasted without them. (Depends on loop buffer and/or uop-cache fetch details). Sure you could measure a call to a trivial `square`, but subtracting times doesn't necessarily tell you a "true cost" for the added work in other contexts; that's not how out-of-order execution works. Code this short doesn't have a cost you can reduce to a single number. – Peter Cordes Aug 05 '20 at 18:55
  • Since the compiler doesn't know `square` is a "pure" function of its args, you could have just used `square(v)` - that can't be optimized away without LTO). Letting the compiler unroll the loop in the caller would remove more loop overhead, leaving more of the total overhead being the call/ret itself. – Peter Cordes Aug 05 '20 at 18:56