8

This is the code that I tested:

#include <iostream>
#include <chrono>
using namespace std;

#define CHRONO_NOW                  chrono::high_resolution_clock::now()
#define CHRONO_DURATION(first,last) chrono::duration_cast<chrono::duration<double>>(last-first).count()

int fib(int n) {
    if (n<2) return n;
    return fib(n-1) + fib(n-2);
}

int main() {
    auto t0 = CHRONO_NOW;
    cout << fib(45) << endl;
    cout << CHRONO_DURATION(t0, CHRONO_NOW) << endl;
    return 0;
}

Of course, there are much faster ways of calculating Fibonacci numbers, but this is a good little stress test that focuses on recursive function calls. There's nothing else to the code, other than the use of chrono for measuring time.

First I ran the test a couple of times in Xcode on OS X (so that's clang), using -O3 optimization. It took about 9 seconds to run.

Then, I compiled the same code with gcc (g++) on Ubuntu (using -O3 again), and that version only took about 6.3 seconds to run! Also, I was running Ubuntu inside VirtualBox on my mac, which could only affect the performance negatively, if at all.

So there you go:

  • Clang on OS X -> ~9 secs
  • gcc on Ubuntu in VirtualBox -> ~6.3 secs.

I know that these are completely different compilers so they do stuff differently, but all the tests I've seen featuring gcc and clang only showed much less of a difference, and in some cases, the difference was the other way around (clang being faster).

So is there any logical explanation why gcc beats clang by miles in this particular example?

phuclv
  • 37,963
  • 15
  • 156
  • 475
notadam
  • 2,754
  • 2
  • 19
  • 35
  • Did you check the assembly output? And what's the version of Clang and gcc? – phuclv Mar 21 '15 at 18:11
  • 2
    Don't use these defines. That's what the `using` directive is for: `using chrono_time_point = chrono::high_resolution_clock::time_point;` – Cubic Mar 21 '15 at 18:11
  • 3
    look at the code generated, it's a simple function – Richard Critten Mar 21 '15 at 18:11
  • @LưuVĩnhPhúc gcc is 4.9.1, and clang is whatever version the latest Xcode includes, I didn't check. How do I check the assembly output? – notadam Mar 21 '15 at 18:14
  • @Cubic What's the reason behind that? – notadam Mar 21 '15 at 18:15
  • 1
    Try `constexpr int fibonacci(int n) { return n<2 ? n : fibonacci(n-1)+fibonacci(n-2); }` – geometrian Mar 21 '15 at 18:19
  • The reason is that you want to avoid 'leaving the language' as much as possible. But that's exactly what you're doing with preprocessor directives. One obvious consequence is that error messages become less useful (giving you the preprocessor resolved name instead of your alias). – Cubic Mar 21 '15 at 18:20
  • @imallett It's about the same. – notadam Mar 21 '15 at 18:26
  • 1
    this is silly. the answer is : "because they are different compilers, because different optimizations are enabled". – UmNyobe Mar 21 '15 at 18:29
  • 1
    Aah, I get it now. I tried both with no optimization flags, and the results were very close this time. The clang version took ~9.5 secs, and gcc took ~10.5. So I guess -O3 makes gcc more likely to inline function calls, while clang with -O3 doesn't do that in this case. – notadam Mar 21 '15 at 18:33
  • 4
    The answer to this question is likely to ephemeral, as both compilers will continue to evolve and improve. – Ira Baxter Mar 21 '15 at 18:51
  • To get the assembly output: [Using GCC to produce readable assembly?](http://stackoverflow.com/q/1289881/995714), [How do you get assembler output from C/C++ source in gcc?](http://stackoverflow.com/q/137038/995714), [How to generate assembly code with clang in Intel syntax?](http://stackoverflow.com/q/10990018/995714) – phuclv Jul 26 '15 at 11:34

2 Answers2

12

GCC 4.9.2 in compiler explorer really does loop-unrolling and inlines a lot of function calls while Clang 3.5.1 calls fib twice each iteration without even tail call optimization like below

fib(int):                                # @fib(int)
        push    rbp
        push    rbx
        push    rax
        mov     ebx, edi
        cmp     ebx, 2
        jge     .LBB0_1
        mov     eax, ebx
        jmp     .LBB0_3
.LBB0_1:
        lea     edi, dword ptr [rbx - 1]
        call    fib(int)       # fib(ebx - 1)
        mov     ebp, eax
        add     ebx, -2
        mov     edi, ebx
        call    fib(int)       # fib(ebx - 2)
        add     eax, ebp
.LBB0_3:
        add     rsp, 8
        pop     rbx
        pop     rbp
        ret

The GCC version is more than 10 times longer, with only a single fib call and 20+ labels for inlining the call, which also means that the last call has been tail-optimized into a jmp or GCC has converted some of the recursion into iteration (since it allocates a big array for storing intermediate values)

I've also brought ICC into perspective, and surprisingly it has 10 call instructions inside fib, and it also inlines fib calls 9 times inside main, but it doesn't convert the recursive code to iterative

Here's the compiler outputs for comparison

Note that you can modify the code like this to make the output easier to read

int fib(int n) {
    if (n<2) return n;
    int t = fib(n-1);
    return t + fib(n-2);
}

Now compiler explorer will highlight which source code line an instruction in the assembly output corresponds to with distinct colors, and you'll easily see how the two calls are made. The line return t + fib(n-2) is compiled by GCC to

.L3:
        mov     eax, DWORD PTR [rsp+112]  # n, %sfp
        add     edx, DWORD PTR [rsp+64]   # D.35656, %sfp
        add     DWORD PTR [rsp+60], edx   # %sfp, D.35656
        sub     DWORD PTR [rsp+104], 2    # %sfp,
Schemetrical
  • 5,506
  • 2
  • 26
  • 43
phuclv
  • 37,963
  • 15
  • 156
  • 475
3

I wouldn't say that gcc beats clang by miles. In my opinion, the performance difference (6.3 seconds vs 9 seconds) is rather small. On my FreeBSD system, clang requires 26.12 seconds and gcc requires 10.55 seconds.

However, the way to debug this is to use g++ -S and clang++ -S to get the assembly output.

I tested this on my FreeBSD system. The assembly language files are too long to post here, but it appears that gcc performs multiple levels of inlining in the Fibonacci calculation function (there were 20 fib() calls in there!) whereas clang simply calls fib(n-1) and fib(n-2) with no levels of inlining.

By the way, my gcc version was 4.2.1 20070831 patched [FreeBSD] and clang version was 3.1 (branches/release_31 156863) 20120523. These were the versions that come with the FreeBSD 9.1-RELEAESE base system. The CPU is AMD Turion II Neo N40L Dual-Core Processor (1497.54-MHz).

phuclv
  • 37,963
  • 15
  • 156
  • 475
juhist
  • 4,210
  • 16
  • 33