Java loop gets slower after some runs / JIT's fault?

Question

So I wanted to benchmark some basic java functionality to add some imformation to this question: What is the gain from declaring a method as static.

I know writing benchmarks is sometimes not easy but what happens here I cannot explain.

Please note that I'm not interessted in how to fix this but on why this happens*

The Test class:

public class TestPerformanceOfStaticVsDynamicCalls {

    private static final long RUNS = 1_000_000_000L;

    public static void main( String [] args ){

        new TestPerformanceOfStaticVsDynamicCalls().run();
    }

    private void run(){

        long r=0;
        long start, end;

        for( int loop = 0; loop<10; loop++ ){

            // Benchmark

            start = System.currentTimeMillis();
            for( long i = 0; i < RUNS; i++ ) {
                r += addStatic( 1, i );
            }
            end = System.currentTimeMillis();
            System.out.println( "Static: " + ( end - start ) + " ms" );

            start = System.currentTimeMillis();
            for( long i = 0; i < RUNS; i++ ) {
                r += addDynamic( 1, i );
            }
            end = System.currentTimeMillis();
            System.out.println( "Dynamic: " + ( end - start ) + " ms" );

            // Do something with r to keep compiler happy
            System.out.println( r );

        }

    }


    private long addDynamic( long a, long b ){

        return a+b;
    }

    private static long addStatic( long a, long b ){

        return a+b;
    }

}

I was expecting the first loop to be warmup and the following loops to be faster.

Running this in Eclipse gave those strange results:

Static: 621 ms
Dynamic: 631 ms
1000000001000000000
Static: 2257 ms
Dynamic: 2501 ms
2000000002000000000
Static: 2258 ms
Dynamic: 2469 ms
3000000003000000000
Static: 2231 ms
Dynamic: 2464 ms
4000000004000000000

So wtf? It got slower. To crosscheck I run the same code with java/c 7:

Static: 620 ms
Dynamic: 627 ms
1000000001000000000
Static: 897 ms
Dynamic: 617 ms
2000000002000000000
Static: 901 ms
Dynamic: 615 ms
3000000003000000000
Static: 888 ms
Dynamic: 616 ms
4000000004000000000

So here only the static calls got slower for the following loops. Even more strange if I rearange the code to only print r after the final loop I get this in Eclipse:

Static: 620 ms
Dynamic: 635 ms
Static: 2285 ms
Dynamic: 893 ms
Static: 2258 ms
Dynamic: 900 ms
Static: 2280 ms
Dynamic: 905 ms
4000000004000000000

And this with java/c 7:

Static: 620 ms
Dynamic: 623 ms
Static: 890 ms
Dynamic: 614 ms
Static: 890 ms
Dynamic: 616 ms
Static: 886 ms
Dynamic: 614 ms
4000000004000000000

while at it changing the order of the dynamic/static benchmarks in eclipse:

Dynamic: 618 ms
Static: 626 ms
1000000001000000000
Dynamic: 632 ms
Static: 2524 ms
2000000002000000000
Dynamic: 617 ms
Static: 2528 ms
3000000003000000000
Dynamic: 622 ms
Static: 2506 ms
4000000004000000000

and in java/c 7:

Dynamic: 625 ms
Static: 646 ms
1000000001000000000
Dynamic: 2470 ms
Static: 633 ms
2000000002000000000
Dynamic: 2459 ms
Static: 635 ms
3000000003000000000
Dynamic: 2464 ms
Static: 645 ms
4000000004000000000

So what is happening here?

EDIT: Some System informations:

Java version "1.7.0_55"
OpenJDK Runtime Environment (IcedTea 2.4.7) (7u55-2.4.7-1ubuntu1)
OpenJDK 64-Bit Server VM (build 24.51-b03, mixed mode)

Intel(R) Core(TM) i7-2720QM CPU @ 2.20GHz

EDIT2:

Using Java8:

Static: 620 ms
Dynamic: 624 ms
1000000001000000000
Static: 890 ms
Dynamic: 618 ms
2000000002000000000
Static: 891 ms
Dynamic: 616 ms
3000000003000000000
Static: 892 ms
Dynamic: 617 ms
4000000004000000000

Other code orderings produce similar strange (but ohter) results here.

http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java might help you — Harry Blargle, Jul 22 '14 at 11:19
On my system : Eclipse Kepler, Java 1.7.0, Windows 7 64bits I do not see any issue with your test : Static: 1870 ms Dynamic: 1901 ms 1000000001000000000 Static: 1871 ms Dynamic: 1901 ms 2000000002000000000 Static: 1861 ms Dynamic: 1901 ms 3000000003000000000 Static: 1871 ms Dynamic: 1891 ms 4000000004000000000 Static: 1871 ms Dynamic: 1881 ms 5000000005000000000 Static: 1871 ms Dynamic: 1911 ms 6000000006000000000 Static: 1860 ms Dynamic: 1881 ms 7000000007000000000 Static: 1871 ms Dynamic: 1891 ms 8000000008000000000 — Eric Nicolas, Jul 22 '14 at 11:32
I can't reproduce your results with JDK 8 on a Win 7 machine. I'm always getting same times for `dynamic` and `static` calls, the `static` ones being about 3-5% faster. What makes a difference is the assignment to an instance or a static variable. Make `r` static, and you'll get faster execution. About another 5% on my machine for both types of calls. — MicSim, Jul 22 '14 at 11:32
@Eric/MicSim: Thanks for trying to reproduce. Could you provide more detailed output of `java -v`? Strangely enough I get similar results even if using java8 ... — Scheintod, Jul 22 '14 at 11:38
java version "1.8.0_05" // Java(TM) SE Runtime Environment (build 1.8.0_05-b13) // Java HotSpot(TM) Client VM (build 25.5-b02, mixed mode, sharing) — MicSim, Jul 22 '14 at 11:51
@MicSim: Do you mind rerunning it with the `-server` option? — Scheintod, Jul 22 '14 at 11:53
I suspect that the JITC is stuck halfway through. Try invoking your `run` method twice. — Hot Licks, Jul 22 '14 at 12:12
java version "1.7.0_09" Java(TM) SE Runtime Environment (build 1.7.0_09-b05) Java HotSpot(TM) Client VM (build 23.5-b02, mixed mode) — Eric Nicolas, Jul 22 '14 at 12:16
@Eric: Thanks. Very similar to mine. Narrows it down to being a Linux problem (or just MY problem) — Scheintod, Jul 22 '14 at 12:35
I still suspect that it's a JITC thing. The JITC has gotten the code "instrumented" to collect statistics, but has not yet switched in the fully JITCed code. — Hot Licks, Jul 22 '14 at 12:35
I've just run the code with `-server` option and it makes a difference. Now times are different, by about 50% off each other. If `r` is an instance variable then `dynamic` takes 50% longer than `static`, and vice versa if you declare `r` as static. That would somehow match your code rearrangement results. There's almost no difference when running without the `-server` option. — MicSim, Jul 22 '14 at 12:59
Note that one of the things that the server option usually does is change the tuning values for the JITC, making it take longer to actually recompile the code, but (presumably) creating more efficient code in the process. The server option assumes that the app will run for hours, and a few sub-optimal minutes at the start is no big deal. — Hot Licks, Jul 22 '14 at 15:29
@Hot Licks: Jup. But (at least) the Linux/amd64 jre7 does not provide a client mode at all. So I've currently no means of testing what would happen there. — Scheintod, Jul 22 '14 at 15:41

score 11 · Accepted Answer · answered Jul 22 '14 at 14:01

Preamble: Writing microbenchmarks manually is almost always doomed to a failure.
There are frameworks that have already solved the common benchmarking problems.

JIT compilation unit is a method. Incorporating several benchmarks into a single method leads to unpredictable results.
JIT heavily relies on the execution profile, i.e. the run-time statistics. If a method runs the first scenario for a long time, JIT will optimize the generated code for it. When the method suddenly switches to another scenario, do not expect it to run at the same speed.
JIT may skip optimizing the code that is not executed. It will leave an uncommon trap for this code. If the trap is ever hit, JVM will deoptimize the compiled method, switch to interpreter and after that recompile the code with the new knowledge. E.g. when your method run is compiled for the first time inside the first hot loop, JIT does not know about System.out.println yet. As soon as the execution reaches println, the earlier compiled code is likely to get deoptimized.
The bigger is method - the harder is to optimize it for JIT compiler. E.g. it may appear that there is not enough spare registers to hold all local variables. That's what happen in your case.

To sum it up, your benchmark seem to pass through the following scenario:

The first hot loop (addStatic) triggers the compilation of run method. The execution profile does not know anything except addStatic method.
System.out.println triggers the deoptimization and after that the second hot loop (addDynamic) leads to recompilation of run method.
Now the execution profile contains information only about addDynamic, so JIT optimizes the second loop, and the first loop appears to have extra register spills:

Optimized loop:

0x0000000002d01054: add    %rbx,%r14
0x0000000002d01057: add    $0x1,%rbx          ;*ladd
                                              ; - TestPerformanceOfStaticVsDynamicCalls::addDynamic@2
                                              ; - TestPerformanceOfStaticVsDynamicCalls::run@105

0x0000000002d0105b: add    $0x1,%r14          ; OopMap{rbp=Oop off=127}
                                              ;*goto
                                              ; - TestPerformanceOfStaticVsDynamicCalls::run@116

0x0000000002d0105f: test   %eax,-0x1c91065(%rip)        # 0x0000000001070000
                                              ;*lload
                                              ; - TestPerformanceOfStaticVsDynamicCalls::run@92
                                              ;   {poll}
0x0000000002d01065: cmp    $0x3b9aca00,%rbx
0x0000000002d0106c: jl     0x0000000002d01054

Loop with an extra register spill:

0x0000000002d011d0: mov    0x28(%rsp),%r11  <---- the problem is here
0x0000000002d011d5: add    %r10,%r11
0x0000000002d011d8: add    $0x1,%r10
0x0000000002d011dc: add    $0x1,%r11
0x0000000002d011e0: mov    %r11,0x28(%rsp)    ;*ladd
                                              ; - TestPerformanceOfStaticVsDynamicCalls::addStatic@2
                                              ; - TestPerformanceOfStaticVsDynamicCalls::run@33

0x0000000002d011e5: mov    0x28(%rsp),%r11  <---- the problem is here
0x0000000002d011ea: add    $0x1,%r11          ; OopMap{[32]=Oop off=526}
                                              ;*goto
                                              ; - TestPerformanceOfStaticVsDynamicCalls::run@44

0x0000000002d011ee: test   %eax,-0x1c911f4(%rip)        # 0x0000000001070000
                                              ;*goto
                                              ; - TestPerformanceOfStaticVsDynamicCalls::run@44
                                              ;   {poll}
0x0000000002d011f4: cmp    $0x3b9aca00,%r10
0x0000000002d011fb: jl     0x0000000002d011d0  ;*ifge
                                              ; - TestPerformanceOfStaticVsDynamicCalls::run@25

P.S. The following JVM options are useful to analyze the JIT compilation:

-XX:+PrintCompilation -XX:+UnlockDiagnosticVMOptions -XX:+PrintInlining -XX:+PrintAssembly -XX:CompileOnly=TestPerformanceOfStaticVsDynamicCalls

Hi. Thanks for your answer. I have two issues understanding it, though: 1. My assembler skils are a little rusty (and limited to 68k anyways). But do i guess correctly that `0x28(%rsp),%r11` and it's counterpart move register to the stack and back again? — Scheintod, Jul 22 '14 at 14:34
And secondly: I still don't understand why the code **gets slower** for the second loop. There has to have been an optimized working version of it already? Do you suggest that the jit throws away the fast version for the crapy one? — Scheintod, Jul 22 '14 at 14:35
@Scheintod Right, `0x28(%rsp),%r11` loads register from stack, and `%r11,0x28(%rsp)` stores R11 to stack. — apangin, Jul 22 '14 at 15:03
@Scheintod "Do you suggest that the jit throws away the fast version for the crapy one?" -- Yes. When the second benchmark starts, JIT recompiles the whole `run` method optimizing it for the second benchmark. The first version of compiled code (optimized for the first benchmark) is thrown away. — apangin, Jul 22 '14 at 15:08
That's not what I meant. ... I think ;) What I was trying to ask: for `loop=0` i get results like: 623ms / 645ms. So there was fast code there for both versions. Only for `loop>=1` I get the slow results. — Scheintod, Jul 22 '14 at 15:20
Thinking about it ... So the JIT first optimizes the first loop, then throws it away, then optimized the second loop, then throws this away, too ... and then gives up? — Scheintod, Jul 22 '14 at 15:21
@Scheintod I meant there was no version optimized for both benchmarks. `loop=0` was executed by two different compilations. The first one has been abandoned right after the first benchmark, because this compilation had not been prepared to handle `println` and the following inner loop. The second compilation was made to execute the whole outer loop, but optimized only for the second benchmark (because the compilation took place during execution of the second benchmark). — apangin, Jul 22 '14 at 18:44
Run java with `-XX:+PrintCompilation` to watch what methods are being compiled and when. — apangin, Jul 22 '14 at 18:45

score 1 · Answer 2 · answered Jul 22 '14 at 14:03

It looks like is the way Java is adding values to variable r. I've made a few changes, adding method run2():

public class TestPerformanceOfStaticVsDynamicCalls {
    private static final long RUNS = 1_000_000_000L;
    public static void main(String[] args) {
        System.out.println("Test run 1 =================================");
        new TestPerformanceOfStaticVsDynamicCalls().run();
        System.out.println("Test run 2 =================================");
        new TestPerformanceOfStaticVsDynamicCalls().run2();
    }
    private void run2() {
        long r = 0;
        long start, end;
        for (int loop = 0; loop < 10; loop++) {
            // Benchmark
            long stat = 0;
            start = System.currentTimeMillis();
            for (long i = 0; i < RUNS; i++) {
                stat += addStatic(1, i);
            }
            end = System.currentTimeMillis();
            System.out.println("Static: " + (end - start) + " ms");
            long dyna = 0;
            start = System.currentTimeMillis();
            for (long i = 0; i < RUNS; i++) {
                dyna += addDynamic(1, i);
            }
            end = System.currentTimeMillis();
            System.out.println("Dynamic: " + (end - start) + " ms");
            // If you really want to have values in "r" then...
            r += stat + dyna;
            // Do something with r to keep compiler happy
            System.out.println(r);
        }
    }
    private void run() {
        long r = 0;
        long start, end;
        for (int loop = 0; loop < 10; loop++) {
            // Benchmark
            start = System.currentTimeMillis();
            for (long i = 0; i < RUNS; i++) {
                r += addStatic(1, i);
            }
            end = System.currentTimeMillis();
            System.out.println("Static: " + (end - start) + " ms");
            start = System.currentTimeMillis();
            for (long i = 0; i < RUNS; i++) {
                r += addDynamic(1, i);
            }
            end = System.currentTimeMillis();
            System.out.println("Dynamic: " + (end - start) + " ms");
            // If you really want to have values in "r" then...
            // Do something with r to keep compiler happy
            System.out.println(r);
        }
    }
    private long addDynamic(long a, long b) {
        return a + b;
    }
    private static long addStatic(long a, long b) {
        return a + b;
    }
}

The results for are:

Test run 1 =================================
Static: 582 ms
Dynamic: 579 ms
1000000001000000000
Static: 2065 ms
Dynamic: 2352 ms
2000000002000000000
Static: 2084 ms
Dynamic: 2345 ms
3000000003000000000
Static: 2095 ms
Dynamic: 2347 ms
4000000004000000000
Static: 2102 ms
Dynamic: 2338 ms
5000000005000000000
Static: 2073 ms
Dynamic: 2345 ms
6000000006000000000
Static: 2074 ms
Dynamic: 2341 ms
7000000007000000000
Static: 2102 ms
Dynamic: 2355 ms
8000000008000000000
Static: 2062 ms
Dynamic: 2354 ms
9000000009000000000
Static: 2057 ms
Dynamic: 2350 ms
-8446744063709551616
Test run 2 =================================
Static: 584 ms
Dynamic: 582 ms
1000000001000000000
Static: 587 ms
Dynamic: 577 ms
2000000002000000000
Static: 577 ms
Dynamic: 579 ms
3000000003000000000
Static: 577 ms
Dynamic: 577 ms
4000000004000000000
Static: 578 ms
Dynamic: 579 ms
5000000005000000000
Static: 578 ms
Dynamic: 580 ms
6000000006000000000
Static: 577 ms
Dynamic: 579 ms
7000000007000000000
Static: 578 ms
Dynamic: 577 ms
8000000008000000000
Static: 580 ms
Dynamic: 578 ms
9000000009000000000
Static: 576 ms
Dynamic: 579 ms
-8446744063709551616

As for why adding directly to r, I have no clue. Maybe somebody can provide more insights on why accessing r inside the loop block makes things much slower.

score 0 · Answer 3 · answered Jul 22 '14 at 15:45

Just one additional Note. I can only observe this strange behavior if I use long for r and is. If I convert them to int then I get these timings:

Static: 352 ms
Dynamic: 353 ms
Static: 348 ms
Dynamic: 349 ms
Static: 349 ms
Dynamic: 348 ms
Static: 349 ms
Dynamic: 344 ms

So one possible conclusion is to avoid long in those situations. At least with Linux/Amd64 Java7 where perfomance matters.

Java loop gets slower after some runs / JIT's fault?

3 Answers3

Linked