Tail recursion vs non tail recursion. Is the former slower?

Question

I am learning the basics of functional programming and Erlang, and I've implemented three versions of the factorial function: using recursion with guards, using recursion with pattern matching, and using tail recursion.

I am trying to compare the performance of each factorial implementation (Erlang/OTP 22 [erts-10.4.1]):

%% Simple factorial code:
fac(N) when N == 0 -> 1;
fac(N) when N > 0 -> N * fac(N - 1).

%% Using pattern matching:
fac_pattern_matching(0) -> 1;
fac_pattern_matching(N) when N > 0 -> N * fac_pattern_matching(N - 1).

%% Using tail recursion (and pattern matching):
tail_fac(N) -> tail_fac(N, 1).

tail_fac(0, Acc) -> Acc;
tail_fac(N, Acc) when N > 0 -> tail_fac(N - 1, N * Acc).

Timer helper:

-define(PRECISION, microsecond).

execution_time(M, F, A, D) ->
  StartTime = erlang:system_time(?PRECISION),
  Result = apply(M, F, A),
  EndTime = erlang:system_time(?PRECISION),
  io:format("Execution took ~p ~ps~n", [EndTime - StartTime, ?PRECISION]),
  if
    D =:= true -> io:format("Result is ~p~n", [Result]);
    true -> ok
  end
.

Execution results:

Recursive version:

3> mytimer:execution_time(factorial, fac, [1000000], false).
Execution took 1253949667 microseconds
ok

Recursive with pattern matching version:

4> mytimer:execution_time(factorial, fac_pattern_matching, [1000000], false).
Execution took 1288239853 microseconds
ok

Tail recursive version:

5> mytimer:execution_time(factorial, tail_fac, [1000000], false).
Execution took 1405612434 microseconds
ok

I was expecting tail recursion version to perform better than the other two but, to my surprise it is less performant. These results are the exact opposite of what I was expecting.

Why?

Writing benchmarks is hard. Are you sure you are measuring what you think you are measuring? Did you account for statistical effects? Did you account for dynamic adaptive optimizations? Did you account for the environment? Here's a couple of examples of what non-obvious things you need to account for in benchmarks: https://groups.google.com/forum/#!msg/mechanical-sympathy/icNZJejUHfE/BfDekfBEs_sJ, https://stackoverflow.com/a/513259/2988. These mainly talk about HotSpot, but most of those problems apply to any modern high-performance execution engine with dynamic adaptive optimizations. — Jörg W Mittag, Jun 16 '19 at 07:57

Hynek -Pichi- Vychodil · Accepted Answer · 2019-06-21T08:34:26.840

The problem is in function which you choose. Factorial is a function which grows very fast. Erlang has implemented big integer arithmetics, so it will not overflow. You are effectively measuring how good is underlying big integer implementation. 1000000! is a huge number. It is 8.26×10^5565708 which is like 5.6MB long written as a decadic number. There is a difference between your fac/1 and tail_fac/1 how fast they reach big numbers where big integer implementation kicks in and how fast the number grows. In you fac/1 implementation you are effectively computing 1*2*3*4*...*N. In your tail_fac/1 implementation you are computing N*(N-1)*(N-2)*(N-3)*...*1. Do you see the issue there? You can write tail call implementation in a different way:

tail_fac2(N) when is_integer(N), N > 0 ->
    tail_fac2(N, 0, 1).

tail_fac2(X, X, Acc) -> Acc;
tail_fac2(N, X, Acc) ->
    Y = X + 1,
    tail_fac2(N, Y, Y*Acc).

It will work much better. I'm not patient as you are so I will measure a little bit smaller numbers but the new fact:tail_fac2/1 shoudl outperform fact:fac/1 every single time:

1> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7743768
2> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7629604
3> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7651739
4> element(1, timer:tc(fun()-> fact:tail_fac(100000) end)).
7229662
5> element(1, timer:tc(fun()-> fact:tail_fac(100000) end)).
7104056
6> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6491195
7> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6506565
8> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6519624

As you can see fact:tail_fac2/1 for N = 100000 takes 6.5s, fact:tail_fac/1 takes 7.2s and fact:fac/1 takes 7.6s. Even faster growth doesn't overturn tail call benefit so tail call version is faster than body recursive one there is clearly seen that slower growth of accumulator in fact:tail_fac2/1 show its impact.

If you choose a different function for tail call optimization testing you can see the impact of tail call optimization more clearly. For example sum:

sum(0) -> 0;
sum(N) when N > 0 -> N + sum(N-1).

tail_sum(N) when is_integer(N), N >= 0 ->
    tail_sum(N, 0).

tail_sum(0, Acc) -> Acc;
tail_sum(N, Acc) -> tail_sum(N-1, N+Acc).

And speed is:

1> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
970749
2> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
126288
3> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
113115
4> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
104371
5> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
125857
6> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
92282
7> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
92634
8> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
68047
9> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
87748
10> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
94233

As you can see, there we can easily use N=10000000 and it works pretty fast. Anyway, body recursive function is significantly slower 110ms vs 85ms. You can notice the first run of fact:sum/1 took 9x longer than the rest of runs. It is because of body recursive function consuming a stack. You will not see such effect when you use a tail recursive counterpart. (Try it.) You can see the difference if you run each measurement in a separate process.

1> F = fun(G, N) -> spawn(fun() -> {T, _} = timer:tc(fun()-> fact:G(N) end), io:format("~p took ~bus and ~p heap~n", [G, T, element(2, erlang:process_info(self(), heap_size))]) end) end.
#Fun<erl_eval.13.91303403>
2> F(tail_sum, 10000000).
<0.88.0>
tail_sum took 70065us and 987 heap
3> F(tail_sum, 10000000).
<0.90.0>
tail_sum took 65346us and 987 heap
4> F(tail_sum, 10000000).
<0.92.0>
tail_sum took 65628us and 987 heap
5> F(tail_sum, 10000000).
<0.94.0>
tail_sum took 69384us and 987 heap
6> F(tail_sum, 10000000).
<0.96.0>
tail_sum took 68606us and 987 heap
7> F(sum, 10000000).
<0.98.0>
sum took 954783us and 22177879 heap
8> F(sum, 10000000).
<0.100.0>
sum took 931335us and 22177879 heap
9> F(sum, 10000000).
<0.102.0>
sum took 934536us and 22177879 heap
10> F(sum, 10000000).
<0.104.0>
sum took 945380us and 22177879 heap
11> F(sum, 10000000).
<0.106.0>
sum took 921855us and 22177879 heap

Tail recursion vs non tail recursion. Is the former slower?

1 Answers1