Why does stack-based iteration not preform better than recursion in C? (Using Fibonacci number as an example)

Question

Brief Intro:

(Minor edit: My intention of asking this question is not about the algorithms themselves. I'm fully aware of the fast iterative solution with 3 local variables or an array of 3 elements. Actually, I deliberately make both tests to have the lowest complexity difference as what I could come up with. What I'd like to know about is that if implementing an identical algorithm to the recursive one attacking the same problem in custom stacks and iteration can improve the performance!)

When we learned about programming at school, we were usually told that iteration would generally be more efficient comparing to recursion, unless recursion gives a specific and elegant way to solve the problem.

So recently I decided to put this into a simple test. Given that function calls are essentially handled by call stacks, thus it's possible to implement a custom stack to handle required local variables and turn a recursive implementation into a iterative one. Below is an example of me implementing a Fibonacci number calculator base on recursion and supposedly an equivalent algorithm using iteration, both written in C.

Method:

Recursive impl. (fibon_recu):

uint64_t calls = 0;

/* Calculate Fibonacci number using recursive algorithm. */
uint64_t fibonacci(uint8_t idx)
{
  calls++;
  return (idx <= 1) ? idx : (fibonacci(idx - 1) + fibonacci(idx - 2));
}

Iterative impl. (fibon_iter):

uint64_t loop_count;

/* Calculate Fibonacci number using stack-based method derived from recursive algorithm. */
uint64_t fibonacci(uint8_t idx)
{
  uint64_t ret = 0;
  uint8_t stack_val[ARR_STACK_SIZE], cache;
  uint16_t stack_size;

  loop_count = 0;

  // Push index into simulated stack
  stack_size = 1;
  *stack_val = idx;

  while(stack_size)
  {
    // Pop simulated stack top
    stack_size -= 1;
    cache = *(stack_val + stack_size);

    if(cache > 1)
    {
      // Push <index - 1> and <index - 2> into simulated stack
      *(stack_val + stack_size) = cache - 1;
      *(stack_val + stack_size + 1) = cache - 2;
      stack_size += 2;
    }
    else
    {
      ret += cache;
    }

    loop_count++;
  }

  return ret;
}

It's worth mentioning that the answer by templatetypedef says:

Depending on the implementation of the stack, that stack might end up needing to dynamically allocate memory in the heap, which is typically more expensive than setting up and tearing down stack frames. As always, if you have two algorithms and want to compare their empirical performance, it's probably best to just run the two against one another and see which ones win.

which is indeed observed on my testing device, I decided to use static array to simulated the stack, as the static array itself will be allocated in the stack frames instead of in heaps. Accessing variables in heaps in this case on my device makes the performance about 20-30 times worse (figure not shown here).

Nonetheless, this answer by Nathaniel Ford suggested that iterative implementation with custom stacks can sometimes improve performance, which I consider being quite interesting.

Also, to make the comparison as fair as what I could achieve, I compiled both pieces of code with -O0 flag to disable compiler optimization(s).

Results:

Where fibon_recu stands for recursive tests and fibon_iter stands for iterative tests. Just for references in case these information are relative, I decided to put both gcc version and basic device info on the figure.

Update:

<Sep. 7, 2021; 11:00 UTC>
Thanks to Ian Abbott for pointing out that using pointer access instead of indexing can improve performance as -O0 flag is present. This brings the execution time for iterative tests down to just being slightly longer than recursive tests.

<Sep. 7, 2021; 22:45 UTC>
Thanks to all generous devs here for the insights and even detailed tests! I made some updates reading those answers and also marked the question as solved. However, please read all the answers below as all of them have different but important perspectives!

First, Eugene pointed out that the function calls themselves could've been optimized even with -O0 flag, while the home-brew stack implementation is not. Thus -O0 is actually still not a fair test after all, as what Lundin suggests.

Second, Jacon pointed out that tail recursion may not be replaced by compiler up to -O1, while -O1 gives a significant improvement on performance for stack-based iterative test conducted by John.

The assembly output and the results suggested by these mentioned answers are listed below:

-O1:
- fibon_iter_O1.asm
- fibon_recu_O1.asm
- Test result:
-O2:
- fibon_iter_O2.asm
- fibon_recu_O2.asm
- Test result:

It turns out the compiler behavior mentioned by Eugene does hold on my test environment, with -O1 flag preserves all the recursive bl calls while -O2 flag optimized the tailing bl call. The improvement observed by John can also be reproduced on my device, shown in the results above: the -O1 flag makes iteration preform much better than recursion.

Thus, I'd guess the test results show that the complexity of the instructions and recursion calls are competing with each other in terms of efficiency and performance. The -O0 flag gives no optimization for home-brew stack at all, resulting in extra complexity that breaks even the advantages of iteration. The -O1 flag retains the recursive calls while optimizing the iterative implementation, granting the latter a performance boost. The -O2 flag removes the tail recursion and makes recursive implementation runs better than that under -O1, making the competing factors apparent again.

Original Question:

Why does the seemingly equivalent iterative implementation still preform not better than the recursive one? Did I get anything obviously wrong, or something is actually hidden under plain sight?

Thanks!

Since you have disabled optimization, it might help to use a pointer to access the simulated stack instead of an index. — Ian Abbott, Sep 07 '21 at 10:39
@IanAbbott Thank you very much for the hint! Your solution brings down the run time for iterative case much closer to the recursive one! I'll update my question soon. — Zevin Zenph Zambori, Sep 07 '21 at 10:58
You ca make it even faster with a purely iterative approach without a stack at all. — Jabberwocky, Sep 07 '21 at 11:08
@Jabberwocky I hope OP is aware of that and is just using an O(2^n) implementation of fibonacci for experimental purposes! :-) — Ian Abbott, Sep 07 '21 at 11:12
I think it's worth considering that this example is based on pretty much the *opposite* of "recursion gives a specific and elegant way to solve the problem." Recursive fibonacci is a bad, pathologically bad, use of recursion. Your "iterative" version isn't really iterative, as it preserves the explosive N² stack growth, and your handcrafted stack is evidently mildly less efficient than your system's own call stack. — Steve Summit, Sep 07 '21 at 11:16
To @Jabberwocky: Yes, Ian Abbott is correct. I made this just to test if iterative implementation can have any impact on performance even if the algorithm is essentially equivalent. I feel like the Fibonacci number is the simplest example I can come up with and the iterative implementation is deliberately designed to mimic the recursive workflow. c: — Zevin Zenph Zambori, Sep 07 '21 at 11:18
Function calls are much more efficient than they used to be, so the "penalty" for using a recursive algorithm is probably less than it once was. — Steve Summit, Sep 07 '21 at 11:18
@SteveSummit thanks for the info! May I ask if I can say that "recursion generally performs worse than iteration" is now becoming a myth, given that function calls are much less expensive, at least in C? (Regardless the stack size limitations or issues like such.) — Zevin Zenph Zambori, Sep 07 '21 at 11:22
If you must use recursion, at least use an algorithm that truly takes advantage of it, and does not need every Fib(n) calculating and *more than once* too, such as Dijkstra's method outlined [here](http://www.maths.surrey.ac.uk/hosted-sites/R.Knott/Fibonacci/fibFormula.html#exact). — Weather Vane, Sep 07 '21 at 11:23
@ZevinZenphZambori Does your programs always have the same execution time? Maybe running it multiple times to take into account the variance of execution time would give an interesting insight... — LNiederha, Sep 07 '21 at 11:25
I would be surprised if the double recursion would be recognised as tail-recursion. When not, it would explode into an Ackermann-like explosion. — wildplasser, Sep 07 '21 at 11:27
...in the example I linked, only 22 F values needed to be computed for F(1000). — Weather Vane, Sep 07 '21 at 11:29
@WeatherVane My question is actually not about the algorithms themselves. I'm actually wondering if the performance can be improved by simply using custom stacks instead of recursive function call stacks. I'll update my question to make my point a bit more clear. c: — Zevin Zenph Zambori, Sep 07 '21 at 11:30
@ZevinZenphZambori I believe the old advice is probably still true (I don't believe it's a myth), but I suspect it will be hard to demonstrate convincingly, in part because *everything* is so fast these days. You will probably have to design a careful experiment, using lots of repetitions, to demonstrate the effect convincingly, without having the results thrown off by irrelevancies. (And it may be hard to know what's real and what's irrelevant.) — Steve Summit, Sep 07 '21 at 11:31
_using stack-based method derived from recursive algorithm_ Well then it is no wonder it is the same. You don't need to perform a dance (with stacks +- function calls) to get the next value - you only have to hold two at a time (but not recursively). The more classic factorial example should be more illustrative and less confusing. Don't know how you can overcomplicate that one. — , Sep 07 '21 at 11:32
Please see [When is optimisation premature?](https://stackoverflow.com/questions/385506/when-is-optimisation-premature) The evil here, is doggedly pursuing the very inefficient algorithm. — Weather Vane, Sep 07 '21 at 11:33
"as -O0 flag is present" If that flag is present, then this question is nonsense. — Lundin, Sep 07 '21 at 11:38
@Jacon I actually tried that too! Unfortunately the value blows up too quickly before any difference can be identified, even when I used double. Still, thanks for the hints! — Zevin Zenph Zambori, Sep 07 '21 at 11:41
@WeatherVane Thank you very much for the info! I'll look into it. I really didn't expect to run into such problem as an amateur coder haha. — Zevin Zenph Zambori, Sep 07 '21 at 11:44
As an (ex-professional) amateur coder, implementing Dijkstra's beautiful approach does pay off in recreational coding. — Weather Vane, Sep 07 '21 at 11:47
@WeatherVane Doesn't Dijkstra's method only save any calculations (versus a simple linear algorithm) if you cache all the intermediate results to avoid recalculating them? EDIT: Scrub that, I was thinking of doing it top-down rather than bottom-up. Bottom-up should be fast. — Ian Abbott, Sep 07 '21 at 12:13
Just as a stylistic note, please use array index notation instead of pointer arithmetic (i.e., `*(stack_val + stack_size)` => `stack_val[stack_size]`). It’s no less performant, it makes your code easier to read, and you’re less likely to make mistakes. — John Bode, Sep 07 '21 at 12:33
@JohnBode Interestingly, it produces different assembly at -O0 (upd: and -O1 too). Not very different, but still, they are not equal. — Eugene Ryabtsev, Sep 07 '21 at 12:40
@IanAbbott the Dijkstra method is particularly useful for problems where `n` might be very large (and the result is constrained by a modulus). In a sense, it is a 'binary' solution. — Weather Vane, Sep 07 '21 at 15:19
@WeatherVane Yes, but it needs some sort of lookup mechanism for storing and retrieving O(log N) intermediate results. There seem to be better methods for calculating Fibonacci numbers than Dijkstra's method though, based on 2x2 matrices of bignums. E.g. `fib8(n)` in `arXiv:1803.07199v2 [cs.DS]`. — Ian Abbott, Sep 07 '21 at 15:24
@IanAbbott no: have a look at the algorithm carefully. The intermediate terms are returned by the recursion. In practice I use a lookup only for the first few F terms, and my implementation is very slightly reorganised. — Weather Vane, Sep 07 '21 at 15:26
My use of "better than Dijkstra" was improper. I meant better from a space complexity point of view. — Ian Abbott, Sep 07 '21 at 15:32
@IanAbbott there may be better methods, but Dijkstra is still vastly better than computing every lesser term, and OP's solution duplicates them too. — Weather Vane, Sep 07 '21 at 15:33
@WeatherVane My naive, top down implementation of Dijkstra that didn't save the intermediate values ended up being O(N) due to the repeated calculations. And I don't know how to do it bottom up (starting from fib(0) and fib(1) and building up to fib(n)). — Ian Abbott, Sep 07 '21 at 15:37
@IanAbbott approximately describing it, you begin with `fib(n)` and recurse for `fib(n/2)` and `fib(n/2 - 1)`. In my implementation the exact details depend on whether `n` is odd or even. — Weather Vane, Sep 07 '21 at 15:42
@WeatherVane That's more or less what I was doing, but calling `fib(n/2-1)` and `fib(n/2)` for even `n`, and `fib(n/2+1)` and `fib(n/2)` for odd `n`. (There are some minor differences in the combination of the two values compared to your method, but the result is the same.) The trouble was (for example) there were 15 calls for fib(10), 163 calls for fib(100), 1511 calls for fib(1000), etc. (Seems to be on the order of phi*n calls in total, if I was to guess.) — Ian Abbott, Sep 07 '21 at 16:26
@IanAbbott the worked example in the link shows that only 22 intermediate values were needed for `fib(1000)`. This is approximately `log2(1000) * 2`. — Weather Vane, Sep 07 '21 at 16:37
@WeatherVane Yes, but each intermediate value may be needed more than once, hence the requirement to cache them instead of recalculating them. — Ian Abbott, Sep 07 '21 at 16:42
@IanAbbott that is as it may be. In my implementation of the idea, the recursive function is named `twoterms()` and computes the two values needed, taking advantage of any duplication. In some cases it also computes a third value from the other two. When called by `fib(1000)` (which is non-recursive) and with only the first three terms looked-up, `twoterms()` is called only 9 times. No caching was needed. In practise I pre-compute 94 terms (max 64-bit can hold) in an array by the linear iterative method. — Weather Vane, Sep 07 '21 at 17:03

John Bode · Accepted Answer · 2021-09-07T20:06:20.963

What I'd like to know about is that if implementing an identical algorithm to the recursive one attacking the same problem in custom stacks and iteration can improve the performance!

This assumes the algorithms are actually identical. Your stack-based algorithm isn't identical to the recursive version - for one thing, you're performing a lot more assignments and tests in the stack-based version than the recursive version, which kills any gains you make by not calling the function recursively. Secondly, the recursive version is (most likely) passing arguments via registers, not the stack.

On my system, turning on level 1 optimization significantly improves the runtime performance of the stack-based version relative to the recursive version (to where the stack-based version takes much less time than the recursive version). On my system, going from -O0 to -O1 speeds up the recursive version by ~26% and the stack-based version by ~67%.

And for what it's worth, this is a naive iterative implementation of fibonacci:

unsigned long fib_it( int n )
{
  /**
   * Circular buffer to store the
   * intermediate values
   */
  unsigned long f[3] = { 0ul, 1ul, 1ul };

  if ( n <= 2 )
    return f[n];

  int i;
  for ( i = 3; i <= n; i++ )
    f[i % 3] = f[(i-1) % 3] + f[(i-2) % 3];

  return f[(i+2) % 3];
}

which absolutely spanks the recursive and stack-based versions in terms of both run time and memory usage as n gets large (you basically just have to worry about overflow). As others have shown there are faster algorithms that minimize the number of intermediate calculations, and there is a closed-form version that's faster yet.

If a recursive algorithm is executing too slowly for you, you don't speed it up by re-implementing the recursiveness of it with your own stacks - you speed it up by using a non-recursive approach. As a definition of the Fibonacci function, F_n = F_n-1 + F_n-2 is compact and easy to understand, but as an implementation of the function it is horribly inefficient because it computes the same values multiple times. The iterative version above computes each value exactly once.

What makes recursive functions like quicksort fast is that they dramatically reduce the size of the problem space with each call. Recursive fibonacci doesn't do that.

Edit

Some numbers. My test harness takes two arguments - the N'th number to compute, and the algorithm to use - s for your stack-based version, r for recursive, and i for my iterative version above. For small N the differences aren't too dramatic (compiled with -O1):

$ ./fib -n 5 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         1  0.000001

$ ./fib -n 5 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         2  0.000002

$ ./fib -n 5 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
   5             5         2  0.000002

As N gets larger, though, differences show up:

$ ./fib -n 40 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155    543262  0.543262

$ ./fib -n 40 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155    330824  0.330824

$ ./fib -n 40 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  40     102334155         2  0.000002

Once you get past N=50 the run times for the recursive and stack-based versions get obnoxiously long, whereas the runtime for the iterative version doesn't change:

 $ ./fib -n 50 -r
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025  66602759 66.602759

 $ ./fib -n 50 -s
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025  39850376 39.850376

 $ ./fib -n 50 -i
   i          F(i)    clocks   seconds
   -          ----    ------   -------
  50   12586269025         2  0.000002

Now, this is how things behave on my system, and this is a 0th level analysis using very crude instrumentation (a couple of calls to clock).

circular buffer with modulo? Closed form? Why not `new = a + b; a = b, b = new;` ? Or `f[2] = f[0] + f[1]; f[0] = f[1], f[1] = f[2]` ? — , Sep 08 '21 at 06:19
@Jacon: Mainly for giggles. Trading mod operations for assignments. For the values of N we're dealing with, performance is a wash between the two versions. By the time N gets large enough for any significant differences to show up (mod version is ever so slightly slower), we've long since overflowed. — John Bode, Sep 08 '21 at 16:12

Eugene Ryabtsev · Answer 2 · 2021-09-07T12:47:47.823

It's just more assembly and more stack access.

As opposed to register-based parameter-passing. You could easily see the differences between first and second algorithms. More assembly code does not necessarily translate to slower speed, but in this case, it does.

Contemporary mainstream ISAs and APIs have some built-in functionality to implement function calls, which is active even at -O0. Not so much optimization for your homebrew stack structure.

score 1 · Answer 3 · answered Sep 07 '21 at 11:44

1

Why does the seemingly equivalent iterative implementation still preform not better than the recursive one? Did I get anything obviously wrong, or something is actually hidden under plain sight?

Because benchmarking code with optimization disabled is utter nonsense. gcc -O0 disables all optimization (but supposedly improves compilation speed). You have effectively disabled loop unrolling, tail call optimization and everything else that is extremely relevant when comparing efficiency of these two snippets.

Also, to make the comparison as fair as what I could achieve, I compiled both pieces of code with -O0 flag to disable compiler optimization(s).

No, you made it as unfair as you could achieve.

You essentially took two sport cars, removed the engines then manually pushed them each down two different slopes to see which car that is fastest...

answered Sep 07 '21 at 11:44

Lundin

195,001
40
254
396

May I ask why? And yes I'm an amateur user. Since I don't really know how compilation optimization is performed, I feel like `-O0` would make less difference at parts that I have no control of. Please tell me if I'm wrong. I'll post `-O2` tests later when I have more time. – Zevin Zenph Zambori Sep 07 '21 at 11:52
1

@ZevinZenphZambori Because with optimization disabled the compiler just barfs out completely inefficient machine code with more or less random performance. Discussing performance of recursion without tail call optimization in mind is pointless, for example, since it will always be dreadfully slow and also slaughter the stack memory unless the compiler manages to unroll it. Which compilers do so-so even to this day, so avoiding recursion in 99% of all use-cases is a good idea. – Lundin Sep 07 '21 at 11:56
To get back to the sports car analogy: you should leave the engine in, tune it up to max and put the pedal to the metal - only then will you actually see which car that's the fastest. – Lundin Sep 07 '21 at 11:59
Thanks for the interesting point! Following the analogy, I guess what I'm looking for is e.g. the relation between gearbox design/material and the max speed of a car. I was hoping that `-O0` would minimize the difference between 2 binaries, as from what I've heard is that optimization can reduce simple recursions into iterations. Turns out the compilation process is much more complex than what I expected and it's quite difficult to just modify the gearbox itself without messing with rest of the parts. – Zevin Zenph Zambori Sep 07 '21 at 20:01

score 1 · Answer 4 · 2021-09-07T15:14:53.060

I compared a two factorial functions with -O3 and -fno-inline. Because with inlining both versions amount to the same! Even so the recursive one does not call itself; but it somehow keeps an extra instruction:

recursive: sub and cmp

11e0:       |  /-> 89 fa                    mov    edx,edi
11e2:       |  |   83 ef 01                 sub    edi,0x1
11e5:       |  |   0f af c2                 imul   eax,edx
11e8:       |  |   83 ff 01                 cmp    edi,0x1
11eb:       |  \-- 75 f3                    jne    11e0 <frec+0x10>

non-rec: tighter inner loop:

11c0:       |  /-> 44 0f af c0              imul   r8d,eax
11c4:       |  |   83 e8 01                 sub    eax,0x1
11c7:       |  \-- 75 f7                    jne    11c0 <fnonrec+0x10>

program:

int fnonrec(int n) {
    int p = 1;
    while (n-- > 1)
        p *= n;
    return p;

}
int frec(int n) {
    if (n == 1)
        return 1;
    return (n * frec(n-1));
}
int main(void) {
    for (int i = 0; i < 10000; i++)
        //printf("%d\n", frec(rand()%1000 + 1));
        printf("%d\n", fnonrec(rand()%1000 + 1));
}

Both take 8 ms redirected to dev/null, the recursive does 50% more instructions in above (sloppy, overflowing) form.

Up to -O1 frec() calls itself recursively; without a "keep-recursion" compilation flag a fair comparison is difficult. Only inlining can be controlled, which is linked to recursion:

a recursive function cannot be inlined and keep the recursion. It would only suppress one of n calls.
a recursive function can be called w/o recursion even with fno-inline. Even though the compiler inlines it into itself. Removing recursion is not inlining.