Brief Intro:
(Minor edit: My intention of asking this question is not about the algorithms themselves. I'm fully aware of the fast iterative solution with 3 local variables or an array of 3 elements. Actually, I deliberately make both tests to have the lowest complexity difference as what I could come up with. What I'd like to know about is that if implementing an identical algorithm to the recursive one attacking the same problem in custom stacks and iteration can improve the performance!)
When we learned about programming at school, we were usually told that iteration would generally be more efficient comparing to recursion, unless recursion gives a specific and elegant way to solve the problem.
So recently I decided to put this into a simple test. Given that function calls are essentially handled by call stacks, thus it's possible to implement a custom stack to handle required local variables and turn a recursive implementation into a iterative one. Below is an example of me implementing a Fibonacci number calculator base on recursion and supposedly an equivalent algorithm using iteration, both written in C.
Method:
Recursive impl. (fibon_recu):
uint64_t calls = 0;
/* Calculate Fibonacci number using recursive algorithm. */
uint64_t fibonacci(uint8_t idx)
{
calls++;
return (idx <= 1) ? idx : (fibonacci(idx - 1) + fibonacci(idx - 2));
}
Iterative impl. (fibon_iter):
uint64_t loop_count;
/* Calculate Fibonacci number using stack-based method derived from recursive algorithm. */
uint64_t fibonacci(uint8_t idx)
{
uint64_t ret = 0;
uint8_t stack_val[ARR_STACK_SIZE], cache;
uint16_t stack_size;
loop_count = 0;
// Push index into simulated stack
stack_size = 1;
*stack_val = idx;
while(stack_size)
{
// Pop simulated stack top
stack_size -= 1;
cache = *(stack_val + stack_size);
if(cache > 1)
{
// Push <index - 1> and <index - 2> into simulated stack
*(stack_val + stack_size) = cache - 1;
*(stack_val + stack_size + 1) = cache - 2;
stack_size += 2;
}
else
{
ret += cache;
}
loop_count++;
}
return ret;
}
It's worth mentioning that the answer by templatetypedef says:
Depending on the implementation of the stack, that stack might end up needing to dynamically allocate memory in the heap, which is typically more expensive than setting up and tearing down stack frames. As always, if you have two algorithms and want to compare their empirical performance, it's probably best to just run the two against one another and see which ones win.
which is indeed observed on my testing device, I decided to use static array to simulated the stack, as the static array itself will be allocated in the stack frames instead of in heaps. Accessing variables in heaps in this case on my device makes the performance about 20-30 times worse (figure not shown here).
Nonetheless, this answer by Nathaniel Ford suggested that iterative implementation with custom stacks can sometimes improve performance, which I consider being quite interesting.
Also, to make the comparison as fair as what I could achieve, I compiled both pieces of code with -O0
flag to disable compiler optimization(s).
Results:
Where fibon_recu
stands for recursive tests and fibon_iter
stands for iterative tests. Just for references in case these information are relative, I decided to put both gcc
version and basic device info on the figure.
Update:
<Sep. 7, 2021; 11:00 UTC>
Thanks to Ian Abbott for pointing out that using pointer access instead of indexing can improve performance as -O0
flag is present. This brings the execution time for iterative tests down to just being slightly longer than recursive tests.
<Sep. 7, 2021; 22:45 UTC>
Thanks to all generous devs here for the insights and even detailed tests! I made some updates reading those answers and also marked the question as solved. However, please read all the answers below as all of them have different but important perspectives!
First, Eugene pointed out that the function calls themselves could've been optimized even with -O0
flag, while the home-brew stack implementation is not. Thus -O0
is actually still not a fair test after all, as what Lundin suggests.
Second, Jacon pointed out that tail recursion may not be replaced by compiler up to -O1
, while -O1
gives a significant improvement on performance for stack-based iterative test conducted by John.
The assembly output and the results suggested by these mentioned answers are listed below:
-O1
:fibon_iter_O1.asm
fibon_recu_O1.asm
- Test result:
-O2
:fibon_iter_O2.asm
fibon_recu_O2.asm
- Test result:
It turns out the compiler behavior mentioned by Eugene does hold on my test environment, with -O1
flag preserves all the recursive bl
calls while -O2
flag optimized the tailing bl
call. The improvement observed by John can also be reproduced on my device, shown in the results above: the -O1
flag makes iteration preform much better than recursion.
Thus, I'd guess the test results show that the complexity of the instructions and recursion calls are competing with each other in terms of efficiency and performance. The -O0
flag gives no optimization for home-brew stack at all, resulting in extra complexity that breaks even the advantages of iteration. The -O1
flag retains the recursive calls while optimizing the iterative implementation, granting the latter a performance boost. The -O2
flag removes the tail recursion and makes recursive implementation runs better than that under -O1
, making the competing factors apparent again.
Original Question:
Why does the seemingly equivalent iterative implementation still preform not better than the recursive one? Did I get anything obviously wrong, or something is actually hidden under plain sight?
Thanks!