Experimenting with tail call optimization (tco), I stumbled upon the following curious example:
unsigned long long int fac1(unsigned long long int n){
if (n==0)
return 1;
return n*fac1(n-1);
}
actually, I was impressed, that gcc was able to perform the tco here (with -O2
flag), because it is not that straight forward:
fac1(unsigned long long):
testq %rdi, %rdi
movl $1, %eax
je .L4
.L3:
imulq %rdi, %rax
subq $1, %rdi
jne .L3
rep ret
.L4:
rep ret
However, after the change of the return type from unsigned long long int
to unsigned int
gcc was not able to perform tlo:
unsigned int fac2(unsigned long long int n){
if (n==0)
return 1;
return n*fac2(n-1);
}
we can clearly see the recursive call in the resulting assembly:
fac2(unsigned long long):
testq %rdi, %rdi
jne .L16
movl $1, %eax
ret
.L16:
pushq %rbx
movq %rdi, %rbx
leaq -1(%rdi), %rdi
call fac2(unsigned long long)
imull %ebx, %eax
popq %rbx
ret
At first, I dismissed this as a missed optimization, but now I'm not that sure, because clang isn't able to perform this optimization as well. So maybe there are subtlities of the language I'm not aware of which prevent this optimization.
Why doesn't gcc perform the tail-call-optimization for the function fac2
but only for fac1
?
It is clear to me, that in the second version, the result must be downcasted. Obviously this is the only difference. But why should this be a problem and prevent tlo?
For example, if I help the compiler and rewrite my function as a classic tail-recursion (which should produce identical results to version fac2
):
unsigned int tlo_fac(unsigned long long int n, unsigned long long int cur){
if (n==0)
return cur;
return tlo_fac(n-1, n*cur);
}
unsigned int fac(unsigned long long int n){
return tlo_fac(n,1);
}
I get a tlo-optimized version which is identical to fac1
(the high 32bit are allowed to contain garbage, so imulq
can be used after inlining):
fac(unsigned long long):
testq %rdi, %rdi
movl $1, %eax
je .L10
.L11:
imulq %rdi, %rax
subq $1, %rdi
jne .L11
.L10:
rep ret