The problem is in function which you choose. Factorial is a function which grows very fast. Erlang has implemented big integer arithmetics, so it will not overflow. You are effectively measuring how good is underlying big integer implementation. 1000000! is a huge number. It is 8.26×10^5565708 which is like 5.6MB long written as a decadic number. There is a difference between your fac/1
and tail_fac/1
how fast they reach big numbers where big integer implementation kicks in and how fast the number grows. In you fac/1
implementation you are effectively computing 1*2*3*4*...*N
. In your tail_fac/1
implementation you are computing N*(N-1)*(N-2)*(N-3)*...*1
. Do you see the issue there? You can write tail call implementation in a different way:
tail_fac2(N) when is_integer(N), N > 0 ->
tail_fac2(N, 0, 1).
tail_fac2(X, X, Acc) -> Acc;
tail_fac2(N, X, Acc) ->
Y = X + 1,
tail_fac2(N, Y, Y*Acc).
It will work much better. I'm not patient as you are so I will measure a little bit smaller numbers but the new fact:tail_fac2/1
shoudl outperform fact:fac/1
every single time:
1> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7743768
2> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7629604
3> element(1, timer:tc(fun()-> fact:fac(100000) end)).
7651739
4> element(1, timer:tc(fun()-> fact:tail_fac(100000) end)).
7229662
5> element(1, timer:tc(fun()-> fact:tail_fac(100000) end)).
7104056
6> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6491195
7> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6506565
8> element(1, timer:tc(fun()-> fact:tail_fac2(100000) end)).
6519624
As you can see fact:tail_fac2/1
for N = 100000
takes 6.5s, fact:tail_fac/1
takes 7.2s and fact:fac/1
takes 7.6s. Even faster growth doesn't overturn tail call benefit so tail call version is faster than body recursive one there is clearly seen that slower growth of accumulator in fact:tail_fac2/1
show its impact.
If you choose a different function for tail call optimization testing you can see the impact of tail call optimization more clearly. For example sum:
sum(0) -> 0;
sum(N) when N > 0 -> N + sum(N-1).
tail_sum(N) when is_integer(N), N >= 0 ->
tail_sum(N, 0).
tail_sum(0, Acc) -> Acc;
tail_sum(N, Acc) -> tail_sum(N-1, N+Acc).
And speed is:
1> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
970749
2> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
126288
3> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
113115
4> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
104371
5> element(1, timer:tc(fun()-> fact:sum(10000000) end)).
125857
6> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
92282
7> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
92634
8> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
68047
9> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
87748
10> element(1, timer:tc(fun()-> fact:tail_sum(10000000) end)).
94233
As you can see, there we can easily use N=10000000
and it works pretty fast. Anyway, body recursive function is significantly slower 110ms vs 85ms. You can notice the first run of fact:sum/1
took 9x longer than the rest of runs. It is because of body recursive function consuming a stack. You will not see such effect when you use a tail recursive counterpart. (Try it.) You can see the difference if you run each measurement in a separate process.
1> F = fun(G, N) -> spawn(fun() -> {T, _} = timer:tc(fun()-> fact:G(N) end), io:format("~p took ~bus and ~p heap~n", [G, T, element(2, erlang:process_info(self(), heap_size))]) end) end.
#Fun<erl_eval.13.91303403>
2> F(tail_sum, 10000000).
<0.88.0>
tail_sum took 70065us and 987 heap
3> F(tail_sum, 10000000).
<0.90.0>
tail_sum took 65346us and 987 heap
4> F(tail_sum, 10000000).
<0.92.0>
tail_sum took 65628us and 987 heap
5> F(tail_sum, 10000000).
<0.94.0>
tail_sum took 69384us and 987 heap
6> F(tail_sum, 10000000).
<0.96.0>
tail_sum took 68606us and 987 heap
7> F(sum, 10000000).
<0.98.0>
sum took 954783us and 22177879 heap
8> F(sum, 10000000).
<0.100.0>
sum took 931335us and 22177879 heap
9> F(sum, 10000000).
<0.102.0>
sum took 934536us and 22177879 heap
10> F(sum, 10000000).
<0.104.0>
sum took 945380us and 22177879 heap
11> F(sum, 10000000).
<0.106.0>
sum took 921855us and 22177879 heap