On improving Haskell's performance compared to C in fibonacci micro-benchmark

Question

I came across this question, which compared the performance of various compilers on computing fibonaci numbers the naive way.

I tried doing this with Haskell to see how it compares to C.

C code:

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  return fib (n-1) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Result:

> gcc -O3 main.c -o fib
> time ./fib 40
165580141
real    0m0.421s
user    0m0.420s
sys 0m0.000s

Haskell:

module Main where
import System.Environment (getArgs)

fib :: Int -> Int
fib n | n < 2 = 1
      | otherwise = fib (n-1) + fib (n-2)

main = getArgs >>= print . fib . read . head

Result:

> ghc -O3 -fllvm -optlo-O3 Main.hs -o fib
> time ./fib 40
165580141
real    0m1.476s
user    0m1.476s
sys 0m0.000s

Profiling with

> ghc -O3 -fllvm -optlo-O3 -prof -auto-all -caf-all -rtsopts Main.hs -fforce-recomp -o fib
> ./fib 40 +RTS -prof

reveals that fib takes 100% time and alloc, no surprise. I took some profile of the heap, but don't know what they imply:

> ./fib 40 +RTS -hc

enter image description here

> ./fib 40 +RTS -hd

enter image description here

So my question: Is there anything I can do from my side to make this Haskell program's performance closer to C, or this is just the way GHC does things that happens to make it slower in this micro-benchmark? (I'm not asking for an asymptotically faster algorithm to compute fibs.)

Thank you very much.

[EDIT]

It turned out that ghc -O3 was faster than ghc -O3 -fllvm -optlo-O3 in this case. But optlo-block-placement made an observable difference for the LLVM backend:

> ghc -O3 Main.hs -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.283s
user    0m1.284s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.449s
user    0m1.448s
sys 0m0.000s

> ghc -O3 -fllvm -optlo-O3 -optlo-block-placement -o fib -fforce-recomp
> time ./fib 40
165580141
real    0m1.112s
user    0m1.096s
sys 0m0.016s

The reason I wanted to investigate this was because both C and OCaml were significantly faster than Haskell for this program. I sort of couldn't accept that and wanted to learn more to make sure I already did everything I could :D

> ocamlopt main.ml -o fib
> time ./fib 40
165580141
real    0m0.668s
user    0m0.660s
sys 0m0.008s

`gcc -O3` propably turns it into a non-recursive loop. I don't know if GHC can do the same. — , Jul 16 '11 at 10:13
Oh really? Fib is not tail recursive. It's not even linearly recursive. How could it be turned into a loop? — Phil, Jul 16 '11 at 17:59
@Po: See http://stackoverflow.com/questions/405770/why-are-compilers-so-stupid/414774#414774 for an example doing it with a (not tail-recursive either) factorial, same principle. You wouldn't be the first to undererstimate C compilers. You can make gcc output assembly with `-S` to check if it does optimize it (I'd check myself, but I don't have easy access to a Unix machine ATM). — , Jul 16 '11 at 18:03
@delnan: I see. That behavior with `-O3` is pretty impressive for an imperative language compiler. But back to this program, how can this `fib` function possibly be turned into non-recursive? You have 2 resursive calls, and the final pending addition is inevitable. And the original author took care to give the argument at runtime so compilers couldn't perform constant propagation like that in `gcc -O3` from the link you gave. — Phil, Jul 16 '11 at 18:25
@Po: The constant has nothing to do with the removal of recursion (go ahead, try it yourself - or just consider the fact that gcc didn't even bother to propagage it on `-O2` but still generated the loop). And while the two recursive calls seem to make it harder, I'd just look and see instead of hypothizing. Again, you wouldn't be the first to underestimate the compiler... — , Jul 16 '11 at 18:57

score 9 · Answer 1 · answered Jul 16 '11 at 11:07

The heap profile is not very interesting here, as GHC compiles fib into a function that operates soleley on the stack. Just look at the profile... Only 800 bytes are ever allocated, the small overhead of your main implementation.

As far as GHC's core-level is concerned, this actually gets optimized as far as it can get. The low-level code generation is another matter though. Let us have a quick dive into the code GHC generates:

_Main_zdwfib_info:
.Lc1RK:
    leal -8(%ebp),%eax
    cmpl 84(%ebx),%eax
    jb .Lc1RM

This is the check for stack space. Probably something C doesn't need, as it can let the operation system handle stack space allocation. Haskell has user level threads, so stack space is managed manually.

    cmpl $2,0(%ebp)
    jl .Lc1RO

The comparison against 2 from your code.

    movl 0(%ebp),%eax
    decl %eax

The parameter is reloaded from the stack and decremented to get the parameter for the recursive call. The reload is probably unnecessary - not sure it makes a difference though.

    movl %eax,-8(%ebp)
    movl $_s1Qp_info,-4(%ebp)
    addl $-8,%ebp
    jmp _Main_zdwfib_info

The parameter and the return address are pushed on top of the stack, and we jump directly to the label in order to recurse.

.Lc1RO:
    movl $1,%esi
    addl $4,%ebp
    jmp *0(%ebp)

The code for the case that the parameter is less than 2. The return value is passed in a register.

Bottom line: Everything seems to be working like it should, it's unlikely you will be able to squeeze much more out of this by changing the program. The custom stack check is an obvious source of slowdowns, not sure whether it can be blamed for the full time difference though.

@is7s: It can not be eliminated - it is an integral part of how programs compiled with GHC work. Note that most programs wouldn't have to do this stack check often, and the custom stack allows GHC to scale up to millions of user threads. — Peter Wortmann, Jul 20 '11 at 19:01

score 6 · Answer 2 · answered Jul 22 '11 at 20:26

6

These seems like a really feeble 'benchmark' as barsoap says. Suppose I compare the following almost equally naive programs:

module Main where
import System.Environment (getArgs)

fib ::  Int ->  Int
fib 0 = 1
fib 1 = 1
fib 2 = 2 
fib n = (\x y -> x + y + y )  (fib (n-3))  (fib (n-2) )

main = getArgs >>= print . fib . read . head

... and in the other corner ...

#include <stdio.h>
#include <stdlib.h>

int fib (int n) {
  if (n < 2) return 1;
  if (n < 3) return n;
  return (fib (n-3) + fib (n-2)) + fib (n-2);
}

int main (int argc, char* argv[]) {
  printf ("%i\n", fib (atoi(argv[1])));
  return 0;
}

Then the Glorious ghc crushes gcc, not too surprisingly, really:

$ ghc --make -fforce-recomp fib.hs -o fibh
[1 of 1] Compiling Main             ( fib.hs, fib.o )
Linking fibh ...
$ time ./fibh 40
165580141

real    0m0.013s
user    0m0.007s
sys 0m0.003s

$ gcc fib.c -o fibc
$ time ./fibc 40
165580141

real    0m1.495s
user    0m1.483s
sys 0m0.003s

now turning on optimizations, ghc picks up a little more speed:

$ ghc --make -fforce-recomp fib.hs -O3 -o fibhO
$ time ./fibhO 40
165580141

real    0m0.007s
user    0m0.002s
sys 0m0.004s

but gcc finally gets a clue.

$ gcc fib.c -O3 -o fibcO
$ time ./fibcO 40
165580141

real    0m0.007s
user    0m0.004s
sys 0m0.002s

I think the explanation is the ghc's wariness of common subexpression elimination: it is dangerous where '(almost) everything is an expression', and it figures the programmer knows how to use a lambda.

answered Jul 22 '11 at 20:26

applicative

8,081
35
38

Is this performance difference due to the use of a lambda expression? – is7s Jul 22 '11 at 20:44
I was making two emendations at once, making things less clear than they might be: first, I exposed that `fib n = fib n-3 + fib n-2 + fib n-2` (and thus supplied the case fib 3) -- as in the C file. With `ghc -O2` this is as fast as `gcc -O3`. Without optimization flags, `ghc` and `gcc` are both slow. If (second emendation) I add the lambda then `ghc` *without* optimization flags is already fast. My thought was that it was basically this lambda that `ghc -O2` and `gcc -O3` see. That `gcc` can see this even in `po`s file is a tiny bit of additional cleverness. – applicative Jul 22 '11 at 22:07
Another way to put it is that, with the use of the lambda, even using `runhugs` is faster the `gcc`, compiling without optimization flags. – applicative Jul 23 '11 at 02:13
But I don't get why or how does lambda addition result in this improvement. – is7s Jul 23 '11 at 07:29
@is7s: Because each sub `(fib n)` is written only once and bound to either the `x` or `y` parameter. Each of these expression then gets (lazily) evaluated only once. The second call of `y` naturally uses the readily evaluated value. – Phil Jul 23 '11 at 17:39
@Po but this doesn't make any difference in the original naive fibonacci algorithm in your question, does it? – is7s Jul 23 '11 at 21:08
In the `po`'s original clause `otherwise = fib (n-1) + fib (n-2)` nothing is (visibly) repeated, so there is no use for a lambda. If we rewrite `fib (n - 1)` as `fib (n - 2) + fib (n - 3)` which it is (for n>3), then we get `otherwise = fib (n-2) + fib (n-3) + fib (n-2)` Then we have a common subexpression `fib (n-2)` and can write `otherwise = fib2 + fib3 + fib2 where fib2 = fib (n-2); fib3 = fib (n - 3)` or the explicit lambda I used. (Or we can write `2 * fib2 + fib3`, of course). To do any of this you need to add a third "base case", `fib 2 = 2`. – applicative Jul 25 '11 at 00:34

score 3 · Answer 3 · answered Jul 16 '11 at 08:56

3

GHC compiles this fine. The next step is to micro-optimize the output from the backend of GHC. Playing with various LLVM flags can help here.

To do this, use ghc-core to inspect the assembly, and try other flags to LLVM to see what you get.

Another approach would be to add a small amount of parallelism.

answered Jul 16 '11 at 08:56

Don Stewart

137,316
36
365
468

1

Thanks Don :D. I played around with LLVM flags a bit. It looks like only `optlo-block-placement` made an observable difference (from ~1.5s to ~1.1s). Btw, I didn't notice that GHC's default backend (simply `ghc -O3`) was faster than `ghc -O3 -fllvm -optlo-O3` (~1.2s, ~1.5s respectively) for this program on my machine. – Phil Jul 16 '11 at 18:03
You can also do more advanced llvm games: see http://donsbot.wordpress.com/2010/03/01/evolving-faster-haskell-programs-now-with-llvm/ – Don Stewart Jul 17 '11 at 02:36

score 1 · Answer 4 · edited May 23 '17 at 10:30

1

Try this:

fibs :: [Integer]
fibs = 0:1:zipWith (+) fibs (tail fibs)

fib n = fibs !! n

$ time ./fib 10000
  33644[...]6875
  ./fib 10000  0.03s user 0.01s system 89% cpu 0.039 total

(That's on an good ole Athlon64 3200+)

The version you are using is, for every n, computing fib (n-1) and fib (n-2), that is, has an complexity of roughly triangular nature. The version above is linear: Each fib is only computed once. Despite what the non-Haskell programming hivemind seems to think, Haskell does not automatically memoise (which would be slower, in general, than good ole dynamic programming, anyway).

There's even faster (using maths tricks) version of fibonnaci on the Haskell Wiki.

Change the C version to be non-recursive, and my bet is that you'll see both Haskell and C having very similar performance. Tight loops are just easier to optimise.

edited May 23 '17 at 10:30

Community

1
1

answered Jul 16 '11 at 11:04

barsoap

3,376
3
23
21

7

From the question: *"compared the performance of various compilers on computing fibonaci numbers **the naive way**"*, also *"I'm not asking for an asymptotically faster algorithm to compute fibs"* – ShreevatsaR Jul 16 '11 at 11:14
1

...I understand the question, it just doesn't fit into my mind why one would want to compare naive implementations of micro-benchmarks. Comparing naive C IO and naive stream-fusion enabled Haskell IO, that I would understand. Also, that zipWith *is* naive, depending on which math book you have in your hand. – barsoap Jul 16 '11 at 11:19
7

The benchmark is to compare implementations of a certain recursive algorithm; the point is not to find the Fibonacci numbers. – ShreevatsaR Jul 16 '11 at 11:25
Thanks barsoap :). The reason I wanted to investigate further was because both C and OCaml were significantly faster than Haskell in this test (0.4s, 0.6s, and 1.4s on my machine, respectively), and I found it sort of hard to accept, so I wanted to learn more to make sure I already did everything I could. – Phil Jul 16 '11 at 17:57
1

@barsoap The naive recursive version tests the implementation's ability to handle a large number of small function calls something that a FP language implementation should be rather good at. – stonemetal Jul 16 '11 at 19:21
@stonemetal The thing is, GHC code usually doesn't end up with a large number of small function calls, due to very aggressive inlining. The productive work/uninlinable call ratio here is very atypical for actual programs. – barsoap Jul 17 '11 at 04:26

On improving Haskell's performance compared to C in fibonacci micro-benchmark

4 Answers4

Linked