2nd UPDATE: Check out hckr answer. Much better than mine.
UPDATE: This is not a comprehensive answer. Just as much as I've been able to puzzle out, and I've had to give up for now due to time constraints.
I'm probably not the best person to answer this question, since as far as compiler optimization is concerned, I know just enough to be dangerous. Hopefully someone who understands Julia's compiler a little better stumbles across this question and can give a more comprehensive response, because from what I can see, your c2
function is doing an awful lot of work that it shouldn't need to.
So, there are at least two issues at play here. Firstly, as it stands, both c1
and c2
will always return nothing
. For some reason I don't understand, the compiler is able to work this out in the case of c1
, but not in the case of c2
. Consequently, after compilation, c1
runs almost instantly because the loop in the algorithm is never actually performed. Indeed:
julia> @btime c1()
1.535 ns (0 allocations: 0 bytes)
You can also see this using @code_native c1()
and @code_native c2()
. The former is only a couple of lines long while the latter contains many more instructions. Also worth noting that the former does not contain any references to the function <=
indicating that the condition in the while
loop has been completely optimized out.
We can deal with this first issue by adding a return x
statement at the bottom of both your functions, which forces the compiler to actually engage with the question of what the final value of x
will be.
However, if you do this, you'll note that c1
is still about 10 times faster than c2
, which is the second puzzling issue about your example.
It seems to me that even with a return x
, a sufficiently clever compiler has all the information it should need to skip the loop entirely. That is, it knows at compile time the start value of x
, the exact value of the transformation inside the loop, and the exact value of the terminating condition. Surprisingly enough, if you run @code_native c1()
(after adding return x
at the bottom), you'll notice that it has indeed worked out the function return value right there in the native code (cmpq $1000000001
):
julia> @code_native c1()
.text
; Function c1 {
; Location: REPL[2]:2
movq $-1, %rax
nopw (%rax,%rax)
; Location: REPL[2]:3
; Function <=; {
; Location: int.jl:436
; Function <=; {
; Location: int.jl:429
L16:
addq $1, %rax
cmpq $1000000001, %rax # imm = 0x3B9ACA01
;}}
jb L16
; Location: REPL[2]:6
retq
nopl (%rax)
;}
so I'm not really sure why it is still doing any work at all!
For reference, here is the output of @code_native c2()
(after adding return x
):
julia> @code_native c2()
.text
; Function c2 {
; Location: REPL[3]:2
pushq %r14
pushq %rbx
pushq %rax
movq $-1, %rbx
movabsq $power_by_squaring, %r14
nopw %cs:(%rax,%rax)
; Location: REPL[3]:3
; Function literal_pow; {
; Location: none
; Function macro expansion; {
; Location: none
; Function ^; {
; Location: intfuncs.jl:220
L32:
addq $1, %rbx
movl $10, %edi
movl $9, %esi
callq *%r14
;}}}
; Function <=; {
; Location: int.jl:436
; Function >=; {
; Location: operators.jl:333
; Function <=; {
; Location: int.jl:428
testq %rax, %rax
;}}}
js L59
cmpq %rax, %rbx
jbe L32
; Location: REPL[3]:6
L59:
movq %rbx, %rax
addq $8, %rsp
popq %rbx
popq %r14
retq
nopw %cs:(%rax,%rax)
;}
There is clearly a lot of additional work going on here for c2
that doesn't make much sense to me. Hopefully someone more familiar with Julia's internals can shed some light on this.