Stack performance in programming languages

Question

Just for fun, I tried to compare the stack performance of a couple of programming languages calculating the Fibonacci series using the naive recursive algorithm. The code is mainly the same in all languages, i'll post a java version:

public class Fib {
 public static int fib(int n) {
  if (n < 2) return 1;
  return fib(n-1) + fib(n-2);
 }

 public static void main(String[] args) {
  System.out.println(fib(Integer.valueOf(args[0])));
 }
}

Ok so the point is that using this algorithm with input 40 I got these timings:

C: 2.796s
Ocaml: 2.372s
Python: 106.407s
Java: 1.336s
C#(mono): 2.956s

They are taken in a Ubuntu 10.04 box using the versions of each language available in the official repositories, on a dual core intel machine.

I know that functional languages like ocaml have the slowdown that comes from treating functions as first order citizens and have no problem to explain CPython's running time because of the fact that it's the only interpreted language in this test, but I was impressed by the java running time which is half of the c for the same algorithm! Would you attribute this to the JIT compilation?

How would you explain these results?

EDIT: thank you for the interesting replies! I recognize that this is not a proper benchmark (never said it was :P) and maybe I can make a better one and post it to you next time, in the light of what we've discussed :)

EDIT 2: I updated the runtime of the ocaml implementation, using the optimizing compiler ocamlopt. Also I published the testbed at https://github.com/hoheinzollern/fib-test. Feel free to make additions to it if you want :)

Apart from the usual rules applying to benchmarks... (1) The OCaml *(native) compiler* is quite aggressive and shouldn't be six times slower than C when dealing with such an important FP concept as recursion. Did you use the bytecode interpreter? (2) What optimization settings for C? — , Nov 08 '10 at 06:51
Did you execute multiple samples? Did you remove outliers? Did you average results? Did you measure clock time or CPU time? Have you even _heard_ of statistics? :-) — paxdiablo, Nov 08 '10 at 06:52
What I'm suprised by is the java run time. I've seen this before... was doing a Quicksort method in C and Java, and Java outperformed C every time. — Nicholas, Nov 08 '10 at 06:53
Haha, it was just a funny experiment, I just used the default settings. I tried three times each program and computed the mean, measuring just CPU time. Even though I recognize it's not fully scientific, I remain impressed by java performance :) — hoheinzollern, Nov 08 '10 at 07:00
@paxdiablo: Statistics are those things that come after lies and damned lies, right? ;-) — James McNellis, Nov 08 '10 at 07:00
@Nicholas: Something sounds fishy. It would be nice to see your C code and know what compiler and optimization settings you used. — R.. GitHub STOP HELPING ICE, Nov 08 '10 at 07:50
@R.: It is fishy. Just by compiling with "-O3" the function as given (and seen as C) passes to 0.85 seconds. — Jens Gustedt, Nov 08 '10 at 08:15
@hoheinzollern: How do you factor out the time spent doing IO? In C `printf()` is quite an expensive function even ignoring the amount of time spent in IO wait. — JeremyP, Nov 08 '10 at 09:17
Without *all* versions of the code and specification of compilers and compile options used, few conclusions can be drawn. Default options for most C compilers are aimed at debugability not performance; with a JIT compiled language like Java, this is not an issue. What no Forth implementation!? ;) — Clifford, Nov 08 '10 at 09:46
@R.. No Optimizations. Both were being compiled with just generic command line arguments (gcc bubble.c and javac bubble.java). Other then that, in C it was a struct that was being sorted and in java we cast them as objects, but bubble sort is bubble sort, both pieces of code did the same algorithm. — Nicholas, Nov 08 '10 at 18:23
There's no such thing as "no optimizations" in general. A Java program will be optimized at runtime by the JVM/JIT. Many C compilers on the other hand intentionally *pessimize* without optimization requested to assist in debugging. You're comparing apples to oranges. Always compare with highest optimization level. — R.. GitHub STOP HELPING ICE, Nov 08 '10 at 20:43
Now I'm curious as to what results an Ackermann implementation would show. — Lars, Nov 08 '10 at 23:38
Care to show us the Python version? 100:1 seems a little excessive. And no, CPython isn't interpreted. — Russell Borogove, Nov 09 '10 at 00:11
Also, how many of these implementations correctly compute the 93rd fibonacci number? — Russell Borogove, Nov 09 '10 at 00:16

paxdiablo · Answer 1 · 2010-11-08T07:58:01.203

17

You might want to crank up the optimisation level of your C compiler. With gcc -O3, that makes a big difference, a drop from 2.015 seconds to 0.766 seconds, a reduction of about 62%.

Beyond that, you need to ensure you've tested correctly. You should run each program ten times, remove the outliers (fastest and slowest), then average the other eight.

In addition, make sure you're measuring CPU time rather than clock time.

Anything less than that, I would not consider a decent statistical analysis and it may well be subject to noise, rendering your results useless.

For what it's worth, those C timings above were for seven runs with the outliers taken out before averaging.

In fact, this question shows how important algorithm selection is when aiming for high performance. Although recursive solutions are usually elegant, this one suffers from the fault that you duplicate a lot of calculations. The iterative version:

int fib(unsigned int n) {
    int t, a, b;
    if (n < 2) return 1;
    a = b = 1;
    while (n-- >= 2) {
        t = a + b;
        a = b;
        b = t;
    }
    return b;
}

further drops the time taken, from 0.766 seconds to 0.078 seconds, a further reduction of 89% and a whopping reduction of 96% from the original code.

And, as a final attempt, you should try out the following, which combines a lookup table with calculations beyond a certain point:

static int fib(unsigned int n) {
    static int lookup[] = {
        1, 1, 2, 3, 5, 8, 13, 21, 34, 55, 89, 144, 233, 377,
        610, 987, 1597, 2584, 4181, 6765, 10946, 17711, 28657,
        46368, 75025, 121393, 196418, 317811, 514229, 832040,
        1346269, 2178309, 3524578, 5702887, 9227465, 14930352,
        24157817, 39088169, 63245986, 102334155, 165580141 };
    int t, a, b;

    if (n < sizeof(lookup)/sizeof(*lookup))
        return lookup[n];
    a = lookup[sizeof(lookup)/sizeof(*lookup)-2];
    b = lookup[sizeof(lookup)/sizeof(*lookup)-1];
    while (n-- >= sizeof(lookup)/sizeof(*lookup)) {
        t = a + b;
        a = b;
        b = t;
    }

    return b;
}

That reduces the time yet again but I suspect we're hitting the point of diminishing returns here.

edited Nov 08 '10 at 07:58

answered Nov 08 '10 at 07:00

paxdiablo

854,327
234
1,573
1,953

1

Ahem, the values are only outliers if they are more than `1.5*(Q3 - Q1)` away from the first or third quartile... – Rafe Kettler Nov 08 '10 at 07:12
Well, that's _one_ definition and probably a good one but I don't believe there's a _rigid_ definition in stats. I have no problem using a simplistic definition if the majority of data points are going to lie within a reasonable range. YMMV. – paxdiablo Nov 08 '10 at 07:21
1

You can use a different standard for an outlier if you wish, but you have to use a standard. Removing the fastest and the slowest is definitely not a standard. – Falmarri Nov 08 '10 at 08:04
If you are going to pick on the algorithms used rather than the benchmark, no amount of nitpicking is complete without mentioning that there is a closed form for the fibonacci sequence. Hence there's a direct function to determine n'th number in sequence and this is the preferred algorithm by far. – user268396 Nov 08 '10 at 08:50
@user268396: Actually, given the fact that the return type is a Java int (32 bit signed), the best way is to use a look up table for every value. See my answer to a previous question: http://stackoverflow.com/questions/3165293/fibonacci-sequence-in-c/3166102#3166102 – JeremyP Nov 08 '10 at 09:04
@Falmarri, if you can point me to a definitive definition of an outlier, I'll be happy to adapt my answer. The definition I've always used is basically those furthest from the average/mean/median or whatever. I don't have a _solid_ background in stats but I also don't take something as gospel just because some random guy on the net tells me so. At least not without some hard evidence to back it up. Especially if they look like Homer Simpson :-) – paxdiablo Nov 08 '10 at 09:09
@user268396, you're correct of course. If there's a better solution than the one I offered, we should use it. I was just trying to get across the idea that sometimes macro optimisation has the best return on investment. – paxdiablo Nov 08 '10 at 09:12
Btw, does "gcc -O3" perform constant propagation? I remember having read somewhere that -O3 did some pure analysis and constant propagation also. I so, this could have been benchmarking printf only? – Phil Nov 08 '10 at 11:24
Oops sorry I thought it was "fibo(40)" hardcoded in the benchmark. I just read again and found it being passed as a parameter – Phil Nov 08 '10 at 11:28
Looking at the assembly code generated, it would seem `-03` triggers some aggressive loop unrolling that pays off. – Thanatos Nov 08 '10 at 14:39

score 11 · Answer 2 · answered Nov 08 '10 at 06:59

11

You say very little about your configuration (in benchmarking, details are everything: commandlines, computer used, ...)

When I try to reproduce for OCaml I get:

let rec f n = if n < 2 then 1 else (f (n-1)) + (f (n-2))

let () = Format.printf "%d@." (f 40)


$ ocamlopt fib.ml
$ time ./a.out 
165580141

real    0m1.643s

This is on an Intel Xeon 5150 (Core 2) at 2.66GHz. If I use the bytecode OCaml compiler ocamlc on the other hand, I get a time similar to your result (11s). But of course, for running a speed comparison, there is no reason to use the bytecode compiler, unless you want to benchmark the speed of compilation itself (ocamlc is amazing for speed of compilation).

answered Nov 08 '10 at 06:59

Pascal Cuoq

79,187
7
161
281

1

nice, I didn't know there were two ocaml compilers, and in fact I used the bytecode one.. thanks for the explanation! – hoheinzollern Nov 08 '10 at 07:21
2

@hoheinzollern There's four! There also `ocamlc.opt`, the bytecode-generating compiler compiled with `ocamlopt`, and `ocamlopt.opt`, the native compiler compiled with itself. These two produce the same code as their respective bytecode-compiled versions, of course. – Pascal Cuoq Nov 08 '10 at 07:49
1

@hoheinzollern - could you update your results above using ocamlopt so we can get a better comparison? – aneccodeal Nov 08 '10 at 15:23

score 4 · Answer 3 · answered Nov 08 '10 at 06:50

4

One possibility is that the C compiler is optimizing on the guess that the first branch (n < 2) is the one more frequently taken. It has to do that purely at compile time: make a guess and stick with it.

Hotspot gets to run the code, see what actually happens more often, and reoptimize based on that data.

You may be able to see a difference by inverting the logic of the if:

public static int fib(int n) {
 if (n >= 2) return fib(n-1) + fib(n-2);
 return 1;
}

It's worth a try, anyway :)

As always, check the optimization settings for all platforms, too. Obviously the compiler settings for C - and on Java, try using the client version of Hotspot vs the server version. (Note that you need to run for longer than a second or so to really get the full benefit of Hotspot... it might be interesting to put the outer call in a loop to get runs of a minute or so.)

answered Nov 08 '10 at 06:50

Jon Skeet

1,421,763
867
9,128
9,194

Changing the branch order in the c code does not change the performance noticeably. There must be some other optimization that I can't figure out now. – hoheinzollern Nov 08 '10 at 06:54
Also, why is c# implementation twice as slow? (trying with the .net I get the same results, it's not mono's fault) – hoheinzollern Nov 08 '10 at 06:55
@hoheinzollem: No idea. Mind you, it's such a small test that it could easily be the kind of thing where the Java team made a trade-off one way, the .NET team made a trade-off the other way, and in this particular case the Java approach works well. – Jon Skeet Nov 08 '10 at 06:57
I just tried your optimization on a toy benchmark (outside IDE, release mode, multiple runs, desktop versions): `C# 1.73s, Java .95s` and with your optimization: `C# 1.35s, Java .94s`. A small benchmark which properly contains flaws but nice :) – Lasse Espeholt Nov 08 '10 at 07:03
@Jon - could it be the result of the CLR not interpreting the code to get traces, and therefore guessing the wrong preferred path? I read somewhere that the CLR always compiled first. – Stephen C Nov 08 '10 at 07:08
On my laptop the C# version on the 32-bit .Net 4 CLR takes 0.77s for the "classic" version and 0.73s for the "optimized" version (release mode, no debugger). The 64-bit version takes 0.70s regardless of the "optimization". – Gabe Nov 08 '10 at 19:58

score 4 · Answer 4 · edited Nov 08 '10 at 19:17

4

I can explain the Python performance. Python's performance for recursion is abysmal at best, and it should be avoided like the plague when coding in it. Especially since stack overflow occurs by default at a recursion depth of only 1000...

As for Java's performance, that's amazing. It's rare that Java beats C (even with very little compiler optimization on the C side)... what the JIT might be doing is memoization or tail recursion...

edited Nov 08 '10 at 19:17

tzot

92,761
29
141
204

answered Nov 08 '10 at 07:07

Rafe Kettler

75,757
21
156
151

Nitpicking: CPython's recursion limit defaults to 1000, at least on common desktop architectures. Otherwise, yeah. – Nov 08 '10 at 07:11
2

Yes, recursion isn't good for Python, especially when going this deep. I decided to try both with Psyco (JIT-compiler) and with Cython. Psyco brought the time down from 75.6 to 3.36 seconds. With Cython, I was able to get it down to 1.27. If the times are truly proportional to those in the OP, then the Cython code would run in about 1.79 seconds on his machine. That's pretty decent by my standards, but maybe the times wouldn't scale so nicely. – Justin Peel Nov 08 '10 at 07:18
@delnan: it varies from OS to OS and version to version, but yes, it appears on the machine I'm on now it's 1000. – Rafe Kettler Nov 08 '10 at 07:22
There's also the time it spends having to calculate the spaces at the beginning of each line :) – JeremyP Nov 08 '10 at 09:10
@JeremyP: In case you really believe that (it sounds tongue-in-cheek, but one never knows for sure...): **No**, Python is not interpreted. It compiles to bytecode before execution (and everything that gets imported has its bytecode cached/written to the disk, so in fact it compiles very rarely). – Nov 08 '10 at 19:48
@delnan: Of course it was tongue in cheek. That's why I put a smiley at the end of the comment. – JeremyP Nov 08 '10 at 21:23

score 2 · Answer 5 · answered Nov 08 '10 at 07:05

2

Note that if the Java Hotspot VM is smart enough to memoise fib() calls, it can cut down the exponentional cost of the algorithm to something nearer to linear cost.

answered Nov 08 '10 at 07:05

user268396

11,576
2
31
26

3

I'm smart enough to "memoize" `fib` calls too. I replace them with `static const int fib[] = { 1, 1, 2, 3, ... };` which is smaller than any possible implementation in code for the range of `int`. – R.. GitHub STOP HELPING ICE Nov 08 '10 at 07:53
But if the JVM uses a sliding window or something for memoising the last N calls to fib() it will outperform your static lookup table. – user268396 Nov 08 '10 at 08:53
1

@user268396: No. R.. only needs an array of 47 fibonacci numbers. The 48th overflows a 32 bit int. – JeremyP Nov 08 '10 at 09:09

steabert · Answer 6 · 2010-11-08T13:25:28.143

With C, you should either declare the fibonacci function "inline", or, using gcc, add the -finline-functions argument to the compile options. That will allow the compiler to do recursive inlining. That's also the reason why with -O3 you get better performance, it activates -finline-functions.

Edit You need to at least specify -O/-O1 to have recursive inlining, also if the function is declared inline. Actually, testing myself I found that declaring the function inline and using -O as compilation flag, or just using -O -finline-functions, my recursive fibonacci code was faster than with -O2 or -O2 -finline-functions.

score 1 · Answer 7 · answered Nov 08 '10 at 09:33

I wrote a C version of the naive Fibonacci function and compiled it to assembler in gcc (4.3.2 Linux). I then compiled it with gcc -O3.

The unoptimised version was 34 lines long and looked like a straight translation of the C code.

The optimised version was 190 lines long and (it was difficult to tell but) it appeared to inline at least the calls for values up to about 5.

score 0 · Answer 8 · answered Nov 12 '10 at 08:44

One C trick which you can try is to disable the stack checking (i e built-in code which makes sure that the stack is large enough to permit the additional allocation of the current function's local variables). This could be dicey for a recursive function and indeed could be the reason behind the slow C times: the executing program might well have run out of stack space which forces the stack-checking to reallocate the entire stack several times during the actual run.

Try to approximate the stack size you need and force the linker to allocate that much stack space. Then disable stack-checking and re-make the program.

Stack performance in programming languages

8 Answers8

Linked