Haskell Fibonacci: Tail-Recursive slower than the classic lazy list approach?

Question

Disclaimer: some possibly dubious Haskell ahead.

I've implemented 2 versions of Fibonacci, one that is linear and tail-recursive (incl. a banged variant), and another using the classic lazy-list approach:

{-# LANGUAGE BangPatterns #-}

-- Linear
fibLinear :: Int -> Integer
fibLinear 0 = 0
fibLinear 1 = 1
fibLinear x = fibLin' 2 1 2
  where
      fibLin' i y z | i == x = y
      fibLin' i y z          = fibLin' (i+1) z (z+y)

-- Linear banged
fibLinearBang :: Int -> Integer
fibLinearBang 0 = 0
fibLinearBang 1 = 1
fibLinearBang x = fibLin' 2 1 2
  where
      fibLin' (!i) (!y) (!z) | i == x = y
      fibLin' (!i) (!y) (!z)          = fibLin' (i+1) z (z+y)

-- Haskeller classic lazy
fibClassic :: Int -> Integer
fibClassic n = fib' !! n
    where fib' = 0:1:zipWith (+) fib' (tail fib')

When using criterion to benchmark it (code), I get somewhat unintuitive results: Fib 30 takes 443.0 ns for fibLinear and 279.5 ns for fibLinearBang, vs 73.51 ns for fibClassic. This is strange because I would think fibLinear & fibLinearBang could be optimised into something like a loop or a jump.

My only guess is maybe fib' in fibClassic is memoised as a constant list despite being inside a function definition so that subsequent invocations of the function reuse the same list for lookup.

Can someone give me a definitive explanation for my benchmark results?

For those interested, I've put the project on Github.

I'd try to make all the argument of `fibLin'` strict and benchmark again. You can do that e.g. using `fibLin' (!_) (!_) (!_) | ...` in the first case, and enabling the relative extension `BangPatterns`. — chi, Oct 10 '16 at 13:22
What's your optimization level? Tail recursion only acts like a loop performance wise when it's strict and GHC might only make your code strict at certain optimization levels. The lazy version definitely does not get memoized and I'm not sure how much that'd even help (you wouldn't need to re-calculate the beginning of the list, but you'd still need to iterate over it and the calculation doesn't take that much time). It's just relatively fast by default. — sepp2k, Oct 10 '16 at 13:24
Also why does `fibLin'` produce a tuple? I don't see you ever using the second element of it. — sepp2k, Oct 10 '16 at 13:24
@chi I agree, the (i+1) and (z+y) may be stored in a thunk, causing more work to be done to evaluate them. — Crazycolorz5, Oct 10 '16 at 13:27
@sepp2k the function has the first element as the 2nd to last one in the series, and the second element as the previous element. Once you get to the right number, you can just discard the second element, as it's only used in computation. — Crazycolorz5, Oct 10 '16 at 13:27
@Crazycolorz5 That information is passed via parameters. Unless I'm blind, there is no point in the code where the second element of the tuple is ever accessed. The recursive case just passed the tuple through untouched. — sepp2k, Oct 10 '16 at 13:30
@sepp2k So you're right. The last recursive call isn't directly providing a tuple. — Crazycolorz5, Oct 10 '16 at 13:33
There is no point in strictifying `i`, because at each step it's compared with `x` and hence forced. But other arguments can indeed be big thunks, however I wouldn't be surprised if Core contains `Int#`s rather than `Int`s and hence bangs won't help. But you're also comparing `x` to `0` and `1` at each recursive step, which is redundant: you can do it just ones. And can't full-laziness actually memoize `fib'`? Though, I think such memoization wouldn't improve performance that much or at all. — effectfully, Oct 10 '16 at 13:38
Thanks all for the feedback; I've updated the code to (1) make `fib'` return a single number instead of a tuple (thanks @sepp2k), and (2) check i only once (thanks @user3237465). However, the benchmark results haven't significantly changed (still within the same territory when std. dev is taken into account), and more importantly, still slower than the lazy list method. I haven't tried the "BangPatterns" thing because I don't know what that entails (not enough Haskell-foo) so more pointers would be appreciated. BTW, full results are [here](https://github.com/lloydmeta/fib-hs#full-results) — lloydmeta, Oct 10 '16 at 14:02
@sepp2k I've set optimisation level to `-O2`; [see the cabal file](https://github.com/lloydmeta/fib-hs/blob/master/fib-hs.cabal#L51-L52) for more info. — lloydmeta, Oct 10 '16 at 14:04
I've managed to add and benchmark the linear, tail-recursive `BangPattern` variant [code](https://github.com/lloydmeta/fib-hs/blob/master/src/Fibonacci.hs#L25-L32). `fibLinearBang` *is* definitely faster than the non-bang-pattern version but is still 3.5 times slower than the lazy list implementation (`fibLinear` is ~6 times slower). — lloydmeta, Oct 10 '16 at 14:51
Your benchmarks are not benchmarking what you think they are - try `fib 100k` or some other large value. For a single invocation, the 'classic lazy' version allocates more memory, takes 5x as long (on my machine, for n=80k) and has nearly 80% GC time (compared to 5% for `fibBang`. Criterion reports that `fibClassic` runs much faster because `fib'` is floated out to the top level and retained for all calls of `fibClassic` (look at the core!). If this result is still surprising, recall that GHC Haskell is heavily optimized for lazy data structures. — user2407038, Oct 10 '16 at 18:27
@user2407038 (1) "Your benchmarks are not benchmarking what you think they are" Sounds catchy ;) Would you be more specific? (2) "`fib'` is floated out to the top level and retained for all calls of `fibClassic` (look at the core!)" This is exactly what I was speculating in my question, but the other commenters seem to be in disagreement with us on this. Is there any way you can prove (doesn't need to be disassembly, could even be literature) that this is happening and put that in an answer? — lloydmeta, Oct 11 '16 at 00:57
@lloydmeta, "This is exactly what I was speculating in my question, but the other commenters seem to be in disagreement with us on this" — it's not a matter of opinion: Core indeed crearly witnesses that `fib'` is floated out. However it's still not very clear to me why `fibClassic` can be faster for some input. My only guess is that storing&processing a lazy list is faster, because you use addition over `Integer`s rather than `Int`s and the former is somewhat expensive. What will happen if you make your functions return `Int`? Is `fibClassic` still faster? — effectfully, Oct 11 '16 at 06:56
@user3237465: (1) "Core indeed crearly witnesses that `fib'` is floated out": sounds intriguing to me. Can you explain what this means and how one can check and verify it is happening? This is intriguing because it sounds like Haskell will, in some cases, lift/float out a function's "local variable" into a constant variable, and there are repercussions wrt memory and computation ! (2) "What will happen if you make your functions return `Int`? " That makes `fibLinearBang` faster at 30ns and `fibClassic` slower at 80ns. — lloydmeta, Oct 11 '16 at 07:39
@lloydmeta, see [Understanding Core](http://book.realworldhaskell.org/read/profiling-and-optimization.html#id679074) and maybe some blog posts lying around. "Haskell will, in some cases, lift/float a function's "local variable"" — this is what the `full-laziness` optimization does. I don't have a good reference unfortunately. — effectfully, Oct 11 '16 at 07:56
@user3237465, thanks for the link :), and more importantly that pointing out `full-laziness` is an actual optimisation term ! I did a bit of searching and found out how to turn it off and the tables have finally turned: Fib30 for `fibLinearBang` is at 259.8 ns and `fibClassic` gets 738.4 ns. I think that's pretty definitive proof that `fib'` is being kept around and reused :) — lloydmeta, Oct 11 '16 at 12:09

score 0 · Answer 1 · answered Oct 11 '16 at 12:19

0

tl;dr: fib' was indeed reused across invocations

Thanks to the commenters, I've learnt that full-laziness is not (just) a term used to describe how Haskell evaluates, but an actual compiler optimisation term, one that is on by default, but can be disabled (by using the -fno-full-laziness GHC-option flag.

Re-running the benchmarks with that flag, we see the tables have turned:

fibLinearBang 259.8 ns
fibClassic 738.4 ns

This performance, along with this guide on GHC compiler options is pretty compelling evidence that fib' does get transformed into something like a top-level reference that is reused (and being a list, memoised) across different invocations of the method.

Both that guide and the paper it links to describing let-lifting admit that let-lifting could result in increased memory residency, which to me is somewhat scary (do not write a fib-as-a-service using fibClassic and expose it to the internet...), but I might come around to it..

answered Oct 11 '16 at 12:19

lloydmeta

1,289
1
15
25

"Pretty compelling evidence" seems a weak condition when you can just *look at the core* and know for sure. The result of compilation depends on compiler version, flags, optimization, guessing as to what happens and which 'optimizations' fire will typically result in long headaches like this. I'm not sure why you believe that `-fno-full-laziness` is a 'solution' - you made the performance of `fibClassic` worse while having no change on that of `fibLinearBang`. Making a program worse than another with compiler flags because you believe it should be so seems completely counterproductive. – user2407038 Oct 11 '16 at 15:13
""Pretty compelling evidence" seems a weak condition when you can just look at the core and know for sure.": The point was to find out *why* `fibClassic` was (imho counterintuitively) beating `fibLinearBang`; the fact that `fno-full-laziness` affected the former and not the latter shows that we have likely found the reason. If you have better evidence for this particular case that `fib'` is being reused or another reason to explain the performance discrepancy, maybe start by providing an answer. I'm interested in what you mean by "look at the core" and how it can be applied here. – lloydmeta Oct 11 '16 at 23:20
You look at the core, in which you see the program which is actually produced by your code, and observe that in this program `fib' :: [Integer]` is floated to the top level, making it a CAF (constant applicative form) which means it (i.e. a single copy) will be retained for the duration of the program's execution. 'full laziness' can't possible affect `fibLinearBang` because even if you were to manually float out the function to the top level, it wouldn't be computed less often - it's only called once by `fibLinearBang`. – user2407038 Oct 12 '16 at 01:05
You keep saying "look at core", but I have yet to see you describe how to do so in this case. Also, you can keep criticising the question, but it has nonetheless puzzled commenters with higher SO and Haskell scores than yourself, showing that the full-lazy optimisation is not intuitive and that not everyone knows when they can rely on it (and in [these](http://stackoverflow.com/questions/35115172/why-is-full-laziness-a-default-optimization), [cases](http://stackoverflow.com/questions/26321784/haskell-compiling-with-o2-drastically-increases-memory-usage) has caused head aches/scratches). – lloydmeta Oct 12 '16 at 01:33
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/125461/discussion-between-lloydmeta-and-user2407038). – lloydmeta Oct 12 '16 at 01:41

Haskell Fibonacci: Tail-Recursive slower than the classic lazy list approach?

1 Answers1