Execution time of recursive Fibonacci function is slower in C than equivalent Nim and Pascal code

Question

At this link there is a benchmark about a recursive Fibonacci function written in various languages. I tried some examples(specifically Nim and Pascal) and verified that the execution time was about the same as on that page. Then I tried the C version below (which I was interested in, since I am learning C and C++) and I was baffled to see that on my machine (an old i5 desktop) the time was ~17 seconds whereas on the benchmark page is ~7 seconds!! I used the same compiler flags as stated on that page (namely -O3 and release target of course) but got much slower execution time. I am on Microsoft Windows and I am using Mingw64 with code::blocks.

I cannot believe that Nim and Pascal are faster. After all, if I am not mistaken, Nim compiles to C!

So what am I doing wrong here?

The code is:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <time.h>

static uint64_t fib(uint64_t n) {
    if (n <= 1) return n;
    return fib(n - 1) + fib(n - 2);
}

int main() {
    time_t t1 = time(NULL);
    uint64_t res = fib(47);
    time_t t2 = time(NULL);
    printf("%llu seconds\n", t2 - t1);
    return 0;
}

Mine is an old i5 desktop with 20gb ram. The problem is not the pc. As i said i tried the Nim example and it took ~7.40 seconds to execute. I tried also the pascal one and it took ~8 seconds... — amateur, Jul 04 '23 at 10:13
I also tried the c++ example and got the same result as the C one. Maybe i'm doing something wrong... — amateur, Jul 04 '23 at 10:17
Why did i get a -1 for asking a simple question? Then it's true what they say, that stackoverflow is full of acid unwelcoming nerds!! Ahahaha — amateur, Jul 04 '23 at 10:20
There's nothing wrong with the code, except it should be %lu not %llu. Welcome to SO!! — Simon Goater, Jul 04 '23 at 10:33
Simon, so you mean 16 seconds is normal? I cannot accept it, C MUST be faster than everything else, forever!!! There must be something else at work here. As i said, Nim compiles to C, so it can't be faster than than C itself. If someone knows, please let me know. Otherwise i'll contact the person who did the benchmark and ask how did he achieve those results. Anyway thank you. — amateur, Jul 04 '23 at 10:45
This program is a classic example of the pitfalls of using recursion. This function solves a simple problem in a ridiculously inefficient way. Other compilers must be better at optimising this ridiculously inefficient code. Check out the C or assembler the other compilers produce if you're interested in exploring this. I would have expected C to at least be on par with other languages too but I'm not too concerned that this case doesn't perform well relative to other languages. — Simon Goater, Jul 04 '23 at 10:51
On my personal, not particularly powerful desktop, the code presented consistently runs in about 7 seconds. Of course, that doesn't really speak to how fast it should run on a dissimilar system, but I am inclined to guess that the issue is not with the program itself. — John Bollinger, Jul 04 '23 at 12:10
If you're only a Skylake-family CPU, did you use [How can I mitigate the impact of the Intel jcc erratum on gcc?](https://stackoverflow.com/q/61256646) ? For me, that makes the difference between 4.70 sec at 3.9GHz (18.3 Gcycles at 3.27 IPC) vs. 3.83 sec at 3.9GHz (14.9 Gcycles at 4.04 IPC). I'm on Linux so I could use `perf stat --all-user -e... ./a.out` to detect high counts for `idq.mite_uops` (legacy decode, about half of the total `uops_issued.any`) so I could see it was running into a slowdown from that. Depending on alignment differences, the effect could perhaps be even worse. — Peter Cordes, Jul 04 '23 at 12:38
(I used GCC 13.1.1 `-O3 -march=native -Wa,-mbranches-within-32B-boundaries`, on i7-6700k Skylake at 3.9GHz while it was running this. "Old i5" is not very informative. The `-march=native` didn't do anything useful; timing about the same, and the asm for `fib.isra.0` didn't use any new instructions.) — Peter Cordes, Jul 04 '23 at 12:44
What C does the Nim compiler produce? Is it compiled by the same C compiler you're using? — Peter Cordes, Jul 04 '23 at 12:50
Which compiler (and version) did you use for Nim, and which one did you use for C? Was the same compiler used for both? Have you checked what assembly they actually produced with `objdump` or [godbolt](https://godbolt.org/) ? — Shark, Jul 04 '23 at 13:03
About *"Nim compiles to C, so it can't be faster than than C itself"*, it depends on *what* C code generates the Nim compiler, I'd strongly doubt that it's as simple as the C snippet provided. — Bob__, Jul 04 '23 at 13:24
@Peter Cordes Yes it's exactly the same compiler i'm using for C, gcc 8.1.0. As far as i know, Nim first compiles the source code to C and then uses gcc compiler to compile to machine code — amateur, Jul 04 '23 at 13:34
@shark I used the exact same compiler, gcc 8.1.0. Thats' why i find it strange. First i compiled the Nim version without the flag -d:release and i had to stop the program because it was taking forever to execute. Then after i applied the -d:release flag it took ~8 secods or 7.4 something, in line with the result on the benchmark page. How can a transpiler produce more optimized code than C itself? I don't know — amateur, Jul 04 '23 at 13:38
Yes, but *what* C source code does the Nim compiler generate? Maybe it's different somehow from the C you wrote by hand, in some important or trivial way that leads to a difference in the final asm. You still haven't said what exact CPU model you have, either. "Old i5" could be anything from Nehalem (2008) to maybe Skylake (2015). Different CPU models have different performance quirks that could be a factor in such a small amount of code. — Peter Cordes, Jul 04 '23 at 15:51
It's not necessarily about "more optimized" C or not, could just be any difference. Although doubly-recursive Fibonacci is the absolute worst way to implement it. But GCC does transform one of the recursions into iteration at `-O3`. I didn't look in detail at exactly what asm it makes. If Nim did anything like this in the C before feeding it to GCC, it might end up with quite different asm. — Peter Cordes, Jul 04 '23 at 15:53
@PeterCordes For the C code Nim produces, see my comment under the answer below, — Chris Arndt, Jul 04 '23 at 16:09
@amateur Your execution environment obviously differs from the one that produced the results in the readme of the repo you linked. When I run their docker image, it states the GCC version used as 9.3.0, while you used 8.1.0. Apart from that you obviously have a different CPU, we can only guess what else is different on your system. Consistent with other commenters here, I get similar run times (around ~5s) for C/C++ as for most of the other static languages when using the docker image. So the root cause seems to lie in the specifics of your environment. — Chris Arndt, Jul 04 '23 at 16:17
FYI, the Microsoft compiler benchmarks this at 0 (that's zero) seconds. The call to `fib` is simply optimised out because the result is unused and there are no side effects. [Demo](https://godbolt.org/z/drs46e1jM). clang does the same thing, and only gcc lags behind. So much for this "benchmark". — n. m. could be an AI, Jul 23 '23 at 12:35

chqrlie · Answer 1 · 2023-07-04T11:01:19.443

You are not doing anything wrong, your system happens to have a slower CPU than the benchmark machine. Running the examples in other languages on your PC would probably produce even longer times. The same code runs in about 8 seconds on my M2 Macbook.

This naive recursive Fibonacci implementation is a classic to measure function call performance. As you can see in the README file, performance is very similar across statically typed compiled languages. None of the compilers tested match the pattern in the fib function to produce code for an alternative algorithm that would run in less than a microsecond.

Here is a modified version that outputs the result in addition to a more precise timing and accepts command line arguments:

#include <stdio.h>
#include <stdlib.h>
#include <stdint.h>
#include <inttypes.h>
#include <time.h>

static uint64_t fib(uint64_t n) {
    if (n <= 1) return n;
    return fib(n - 1) + fib(n - 2);
}

static double time_microseconds(void) {
#ifdef TIME_UTC
    struct timespec ts;
    timespec_get(&ts, TIME_UTC);
    return ts.tv_sec * 1000000.0 + ts.tv_nsec / 1000;
#elif 1
    return clock() * 1000000.0 / CLOCKS_PER_SEC;
#else
    return time(NULL) * 1000000.0;
#endif
}

static void test_fib(uint64_t n) {
    double t1 = time_microseconds();
    uint64_t res = fib(n);
    double t2 = time_microseconds();
    printf("fib(%"PRIu64") = %"PRIu64"  %g seconds\n", n, res, (t2 - t1) / 1000000.0);
}

int main(int argc, char *argv[]) {
    if (argc > 1) {
        for (int i = 1; i < argc; i++)
            test_fib(strtoul(argv[i], NULL, 0));
    } else {
        test_fib(47);
    }
    return 0;
}

Output on my M2 laptop:

chqrlie@macbook ~/dev/stackoverflow > ./230704-fib 10 20 30 {40..47}
fib(10) = 55  1e-06 seconds
fib(20) = 6765  4.5e-05 seconds
fib(30) = 832040  0.00471 seconds
fib(40) = 102334155  0.29385 seconds
fib(41) = 165580141  0.453271 seconds
fib(42) = 267914296  0.728588 seconds
fib(43) = 433494437  1.18065 seconds
fib(44) = 701408733  1.91074 seconds
fib(45) = 1134903170  3.08946 seconds
fib(46) = 1836311903  5.02768 seconds
fib(47) = 2971215073  8.17137 seconds

The faster version below produces the same results in less than a microsecond for all values tested, but so would it if implemented this way in other languages:

static uint64_t fib_fast(uint64_t n) {
    uint64_t a = 0, b = 1, c = n;
    while (n --> 1) {
        c = a + b;
        a = b;
        b = c;
    }
    return c;
}

Maybe i didn't explain myself well(english is not my first tongue). I said that i tried the Nim and Pascal versions of the function and the execution time was in line with that of the benchmark (~8 seconds). But C and C++ are much slower. Why is that? Furthemore the Nim compiler is just a C transpiler. It first compiles to c code and then uses the same compiler i'm using for C to compile to machine code. Am i clear? But while nim result was ~8 seconds C result was ~16 seconds. Why? — amateur, Jul 04 '23 at 11:52
@amateur: interesting! can you get Nim to produce the C source code and post it at the end of the question? Also did you try `-O3` or `-O4` ? — chqrlie, Jul 04 '23 at 13:16
I don't think so. You don't see the C code. It spits out the exe and that's all — amateur, Jul 04 '23 at 13:50
You can use `nim c -d:release --genScript:on --nimcache=nimcache fib.nim` to let nim create the source code in the sub-directory `nimcache`, plus a script to compile it. — Chris Arndt, Jul 04 '23 at 14:01
@Chris i did as you suggested and a folder called nimcache was created with a lot of c files in it. I cannot post them — amateur, Jul 04 '23 at 14:37
The code from `fib.nim` ends up in `nimcache/@mfib.nim.c`. I pasted it here: https://cpaste.org/?1d14b6e1a3bcedd5#9nrPEo2F7c9WnjuAXEPyoC49yZq6wwLonkAN9cd6g7tP The gcc options used by the `nimcache/compile_fib.sh` script are `-w -fmax-errors=3 -O3 -fno-strict-aliasing -fno-ident`, — Chris Arndt, Jul 04 '23 at 14:41
The nim code translates to a C function that uses `goto` statements but basically performs the same steps as the C version. On my system, the benchmark shows insignificant differences in the timings. If you get a factor of 2 on your system, there must be something fishy going on. — chqrlie, Jul 04 '23 at 20:28

score 1 · Answer 2 · answered Jul 23 '23 at 12:17

I cannot believe that Nim and Pascal are faster. After all, if I am not mistaken, Nim compiles to C!

Regarding this specific question, you can find the explanation in the issues area of drujensen's github repository: The fastest languages are cheating #119

Execution time of recursive Fibonacci function is slower in C than equivalent Nim and Pascal code

2 Answers2