Why do these two pieces of Julia code perform so differently?

Question

function c1()
        x::UInt64 = 0
        while x<= (10^8 * 10)
                x+=1
        end
end

function c2()
        x::UInt64 = 0
        while x<= (10^9)
                x+=1
        end
end

function c3()
        x::UInt64 = 0
        y::UInt64 = 10^8 * 10
        while x<= y
                x+=1
        end
end

function c4()
        x::UInt64 = 0
        y::UInt64 = 10^9
        while x<= y
                x+=1
        end
end

Should be the same, right?

@time c1()

0.019102 seconds (40.99 k allocations: 2.313 MiB)

@time c1()

0.000003 seconds (4 allocations: 160 bytes)

@time c2()

9.205925 seconds (47.89 k allocations: 2.750 MiB)

@time c2()

9.015212 seconds (4 allocations: 160 bytes)

@time c3()

0.019848 seconds (39.23 k allocations: 2.205 MiB)

@time c3()

0.000003 seconds (4 allocations: 160 bytes)

@time c4()

0.705712 seconds (47.41 k allocations: 2.719 MiB)

@time c4()

0.760354 seconds (4 allocations: 160 bytes)

Not familiar with Julia but how much time will it take to run the code with some `y::Uint64 = (10^8*10)` and substitution of the expression in `while` with the variable? `while x<= y` — Lukas G, Oct 31 '18 at 23:40
Odd. The first problem is that your functions currently *always* return `nothing` and for some reason the compiler is able to work this out for `c1` and `c3` and basically just skips the entire routine. However, once you add `return x` to the bottom of your functions, there is still a difference in timings that I'm not sure I can account for. Still thinking about it. It is almost certainly some sort of compiler optimization being triggered, I just can't see what it is. — Colin T Bowers, Oct 31 '18 at 23:49
Also, `using Pkg ; Pkg.add("BenchmarkTools") ; using BenchmarkTools`. Now use `@btime` to time your routines instead of `@time`. Much better solution and you won't have to keep running things twice to get rid of compile-time complications. — Colin T Bowers, Oct 31 '18 at 23:50
I bet it has something to do with compiler logic that if x <= 10^8 then it *must* also be true that x <= 10^8*10 and so the compiler can skip those checks. I'll write it up. — Colin T Bowers, Oct 31 '18 at 23:53
`@btime` shows 0 allocations whereas `@time` shows many more (depending on the function). Which is correct? — , Nov 01 '18 at 00:31
It is about `8` in `10^8` being a power of two, and `power_by_square`, and the following optimizations at compile-time. Please see my answer. — hckr, Nov 01 '18 at 07:20

hckr · Accepted Answer · 2018-11-01T09:12:29.350

This is about Julia's compile-time optimization of literals using power-by-squaring. Julia is able to optimize if the exponent can be reached via solely power-by-squaring or the power is 0,1,2,3. This is I believe done through lowering x^p to x^Val{p} for integer p and using compiler-specialization (or inlining plus a kind of metaprogramming, I am not sure what is the correct term here but it's like something you will find in Lisp; similar techniques is used for source-to-source auto-differentiation in Julia, see Zygote.jl) techniques to lower the code to a constant if p is 0,1,2,3 or a power of 2.

Julia lowers 10^8 to inlined literal_pow (and then power_by_squaring) and this gets lowered down to a constant then julia lowers constant * 10 to get another constant and then realizes all the while loop is unnecessary and removes the loop and so on, all at compile-time.

If you change 10^8 with 10^7 in c1 you will see that it will evaluate the number and the loop at runtime. However, if you replace 10^8 with 10^4 or 10^2 you will see that it will handle all the computation at compile time. I think julia is not set specifically to compile-time optimize if the exponent is power of 2 but instead the compiler turns out to be able to optimize (lowering the code to a constant) the code just for that case.

The case in which p is 1,2,3 is hard-coded in Julia. This is optimized again through lowering the code to inlined version of literal_pow and then compile-specialization.

You can use @code_llvm and @code_native macros to see what's going on. Let's try.

julia> f() = 10^8*10
julia> g() = 10^7*10

julia> @code_native f()
.text
; Function f {
; Location: In[101]:2
    movl    $1000000000, %eax       # imm = 0x3B9ACA00
    retq
    nopw    %cs:(%rax,%rax)
;}

julia> @code_native g()
.text
; Function g {
; Location: In[104]:1
; Function literal_pow; {
; Location: none
; Function macro expansion; {
; Location: none
; Function ^; {
; Location: In[104]:1
    pushq   %rax
    movabsq $power_by_squaring, %rax
    movl    $10, %edi
    movl    $7, %esi
    callq   *%rax
;}}}
; Function *; {
; Location: int.jl:54
    addq    %rax, %rax
    leaq    (%rax,%rax,4), %rax
;}
    popq    %rcx
    retq
;}

See f() turns out to be just a constant, while g() will evaluate stuff at run-time.

I think julia started this integer exponentiation trick around this commit if you would like to dig more.

EDIT: Let's compile-time optimize c2

I have also prepared a function to compute integer-integer exponents, with which julia will also optimize for non-power-of-2 exponents. I am not sure it is correct in all cases, though.

@inline function ipow(base::Int, exp::Int)
    result = 1;
    flag = true;
    while flag
        if (exp & 1  > 0)
            result *= base;
        end
        exp >>= 1;
        base *= base;
        flag = exp != 0
    end

    return result;
end

Now replace your 10^9 in c2 with ipow(10,9), and enjoy the power of compile-time optimization.

Also see this question for power-by-squaring.

Please don't use this function as is, since it tries to inline all the exponentiation, whether or not it consists of literals. You wouldn't want that.

It would have taken me a very, very long time to work this out. Thanks for a great answer. I learned a lot. — Colin T Bowers, Nov 01 '18 at 10:59
Though, I am not sure why the compiler cannot specialize for the non-power-of-2 powers. A little more investigation may help. — hckr, Nov 01 '18 at 11:19

Colin T Bowers · Answer 2 · 2018-11-01T22:21:22.400

2nd UPDATE: Check out hckr answer. Much better than mine.

UPDATE: This is not a comprehensive answer. Just as much as I've been able to puzzle out, and I've had to give up for now due to time constraints.

I'm probably not the best person to answer this question, since as far as compiler optimization is concerned, I know just enough to be dangerous. Hopefully someone who understands Julia's compiler a little better stumbles across this question and can give a more comprehensive response, because from what I can see, your c2 function is doing an awful lot of work that it shouldn't need to.

So, there are at least two issues at play here. Firstly, as it stands, both c1 and c2 will always return nothing. For some reason I don't understand, the compiler is able to work this out in the case of c1, but not in the case of c2. Consequently, after compilation, c1 runs almost instantly because the loop in the algorithm is never actually performed. Indeed:

julia> @btime c1()
  1.535 ns (0 allocations: 0 bytes)

You can also see this using @code_native c1() and @code_native c2(). The former is only a couple of lines long while the latter contains many more instructions. Also worth noting that the former does not contain any references to the function <= indicating that the condition in the while loop has been completely optimized out.

We can deal with this first issue by adding a return x statement at the bottom of both your functions, which forces the compiler to actually engage with the question of what the final value of x will be.

However, if you do this, you'll note that c1 is still about 10 times faster than c2, which is the second puzzling issue about your example.

It seems to me that even with a return x, a sufficiently clever compiler has all the information it should need to skip the loop entirely. That is, it knows at compile time the start value of x, the exact value of the transformation inside the loop, and the exact value of the terminating condition. Surprisingly enough, if you run @code_native c1() (after adding return x at the bottom), you'll notice that it has indeed worked out the function return value right there in the native code (cmpq $1000000001):

julia> @code_native c1()
    .text
; Function c1 {
; Location: REPL[2]:2
    movq    $-1, %rax
    nopw    (%rax,%rax)
; Location: REPL[2]:3
; Function <=; {
; Location: int.jl:436
; Function <=; {
; Location: int.jl:429
L16:
    addq    $1, %rax
    cmpq    $1000000001, %rax       # imm = 0x3B9ACA01
;}}
    jb  L16
; Location: REPL[2]:6
    retq
    nopl    (%rax)
;}

so I'm not really sure why it is still doing any work at all!

For reference, here is the output of @code_native c2() (after adding return x):

julia> @code_native c2()
    .text
; Function c2 {
; Location: REPL[3]:2
    pushq   %r14
    pushq   %rbx
    pushq   %rax
    movq    $-1, %rbx
    movabsq $power_by_squaring, %r14
    nopw    %cs:(%rax,%rax)
; Location: REPL[3]:3
; Function literal_pow; {
; Location: none
; Function macro expansion; {
; Location: none
; Function ^; {
; Location: intfuncs.jl:220
L32:
    addq    $1, %rbx
    movl    $10, %edi
    movl    $9, %esi
    callq   *%r14
;}}}
; Function <=; {
; Location: int.jl:436
; Function >=; {
; Location: operators.jl:333
; Function <=; {
; Location: int.jl:428
    testq   %rax, %rax
;}}}
    js  L59
    cmpq    %rax, %rbx
    jbe L32
; Location: REPL[3]:6
L59:
    movq    %rbx, %rax
    addq    $8, %rsp
    popq    %rbx
    popq    %r14
    retq
    nopw    %cs:(%rax,%rax)
;}

There is clearly a lot of additional work going on here for c2 that doesn't make much sense to me. Hopefully someone more familiar with Julia's internals can shed some light on this.

@oneloop I was running out the door when posting that answer so it was all a bit rushed. In hindsight, I don't like my previous idea. I've had a look at `@code_native` and typed up a new answer that certainly demonstrates the issue with `nothing` but bottom-line it completely fails to puzzle out why `c2` is slower than `c1` even after adding a `return x` statement. Hopefully someone who knows more about Julia's internals comes along and answers what I think is a very interesting question. — Colin T Bowers, Nov 01 '18 at 05:43

Why do these two pieces of Julia code perform so differently?

2 Answers2