12

I have this piece of code doing Range Minimum Query. When t = 100000, i and j are always changed in each input line, its execution time in Java 8u60 is about 12 secs.

for (int a0 = 0; a0 < t; a0++) {
    String line = reader.readLine();
    String[] ls = line.split(" ");
    int i = Integer.parseInt(ls[0]);
    int j = Integer.parseInt(ls[1]);
    int min = width[i];
    for (int k = i + 1; k <= j; k++) {
        if (min > width[k]) {
            min = width[k];
        }
    }
    writer.write(min + "");
    writer.newLine();
}

When I extract a new method to find minimum Value, the execution time is 4 times faster (about 2.5 secs).

    for (int a0 = 0; a0 < t; a0++) {
        String line = reader.readLine();
        String[] ls = line.split(" ");
        int i = Integer.parseInt(ls[0]);
        int j = Integer.parseInt(ls[1]);
        int min = getMin(i, j);
        writer.write(min + "");
        writer.newLine();
    }

private int getMin(int i, int j) {
    int min = width[i];
    for (int k = i + 1; k <= j; k++) {
        if (min > width[k]) {
            min = width[k];
        }
    }
    return min;
}

I always thought that method calls are slow. But this example shows the opposite. Java 6 also demonstrates this, but the execution times is much slower in both case (17 secs and 10 secs). Can someone provide some insight into this?

Tu Le
  • 309
  • 2
  • 7

5 Answers5

16

TL;DR JIT compiler has more opportunities to optimize the inner loop in the second case, because on-stack replacement happens at the different point.

I've managed to reproduce the problem with the reduced test case.
No I/O or string operations involved, just two nested loops with array access.

public class NestedLoop {
    private static final int ARRAY_SIZE = 5000;
    private static final int ITERATIONS = 1000000;

    private int[] width = new java.util.Random(0).ints(ARRAY_SIZE).toArray();

    public long inline() {
        long sum = 0;

        for (int i = 0; i < ITERATIONS; i++) {
            int min = width[0];
            for (int k = 1; k < ARRAY_SIZE; k++) {
                if (min > width[k]) {
                    min = width[k];
                }
            }
            sum += min;
        }

        return sum;
    }

    public long methodCall() {
        long sum = 0;

        for (int i = 0; i < ITERATIONS; i++) {
            int min = getMin();
            sum += min;
        }

        return sum;
    }

    private int getMin() {
        int min = width[0];
        for (int k = 1; k < ARRAY_SIZE; k++) {
            if (min > width[k]) {
                min = width[k];
            }
        }
        return min;
    }

    public static void main(String[] args) {
        long startTime = System.nanoTime();
        long sum = new NestedLoop().inline();  // or .methodCall();
        long endTime = System.nanoTime();

        long ms = (endTime - startTime) / 1000000;
        System.out.println("sum = " + sum + ", time = " + ms + " ms");
    }
}

inline variant indeed works 3-4 times slower than methodCall.


I've used the following JVM options to confirm that both benchmarks are compiled on the highest tier and OSR (on-stack replacement) successfully occurs in both cases.

-XX:-TieredCompilation
-XX:CompileOnly=NestedLoop
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintCompilation
-XX:+TraceNMethodInstalls

'inline' compilation log:

    251   46 %           NestedLoop::inline @ 21 (70 bytes)
Installing osr method (4) NestedLoop.inline()J @ 21

'methodCall' compilation log:

    271   46             NestedLoop::getMin (41 bytes)
Installing method (4) NestedLoop.getMin()I 
    274   47 %           NestedLoop::getMin @ 9 (41 bytes)
Installing osr method (4) NestedLoop.getMin()I @ 9
    314   48 %           NestedLoop::methodCall @ 4 (30 bytes)
Installing osr method (4) NestedLoop.methodCall()J @ 4

This means JIT does its job, but the generated code must be different.
Let's analyze it with -XX:+PrintAssembly.


'inline' disassembly (the hottest fragment)

0x0000000002df4dd0: inc    %ebp               ; OopMap{r11=Derived_oop_rbx rbx=Oop off=114}
                                              ;*goto
                                              ; - NestedLoop::inline@53 (line 12)

0x0000000002df4dd2: test   %eax,-0x1d64dd8(%rip)        # 0x0000000001090000
                                              ;*iload
                                              ; - NestedLoop::inline@21 (line 12)
                                              ;   {poll}
0x0000000002df4dd8: cmp    $0x1388,%ebp
0x0000000002df4dde: jge    0x0000000002df4dfd  ;*if_icmpge
                                              ; - NestedLoop::inline@26 (line 12)

0x0000000002df4de0: test   %rbx,%rbx
0x0000000002df4de3: je     0x0000000002df4e4c
0x0000000002df4de5: mov    (%r11),%r10d       ;*getfield width
                                              ; - NestedLoop::inline@32 (line 13)

0x0000000002df4de8: mov    0xc(%r10),%r9d     ; implicit exception
0x0000000002df4dec: cmp    %r9d,%ebp
0x0000000002df4def: jae    0x0000000002df4e59
0x0000000002df4df1: mov    0x10(%r10,%rbp,4),%r8d  ;*iaload
                                              ; - NestedLoop::inline@37 (line 13)

0x0000000002df4df6: cmp    %r8d,%r13d
0x0000000002df4df9: jg     0x0000000002df4dc6  ;*if_icmple
                                              ; - NestedLoop::inline@38 (line 13)

0x0000000002df4dfb: jmp    0x0000000002df4dd0

'methodCall' disassembly (also the hottest part)

0x0000000002da2af0: add    $0x8,%edx          ;*iinc
                                              ; - NestedLoop::getMin@33 (line 36)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2af3: cmp    $0x1381,%edx
0x0000000002da2af9: jge    0x0000000002da2b70  ;*iload_1
                                              ; - NestedLoop::getMin@16 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2afb: mov    0x10(%r9,%rdx,4),%r11d  ;*iaload
                                              ; - NestedLoop::getMin@22 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b00: cmp    %r11d,%ecx
0x0000000002da2b03: jg     0x0000000002da2b6b  ;*iinc
                                              ; - NestedLoop::getMin@33 (line 36)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b05: mov    0x14(%r9,%rdx,4),%r11d  ;*iaload
                                              ; - NestedLoop::getMin@22 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b0a: cmp    %r11d,%ecx
0x0000000002da2b0d: jg     0x0000000002da2b5c  ;*iinc
                                              ; - NestedLoop::getMin@33 (line 36)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b0f: mov    0x18(%r9,%rdx,4),%r11d  ;*iaload
                                              ; - NestedLoop::getMin@22 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b14: cmp    %r11d,%ecx
0x0000000002da2b17: jg     0x0000000002da2b4d  ;*iinc
                                              ; - NestedLoop::getMin@33 (line 36)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b19: mov    0x1c(%r9,%rdx,4),%r11d  ;*iaload
                                              ; - NestedLoop::getMin@22 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b1e: cmp    %r11d,%ecx
0x0000000002da2b21: jg     0x0000000002da2b66  ;*iinc
                                              ; - NestedLoop::getMin@33 (line 36)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b23: mov    0x20(%r9,%rdx,4),%r11d  ;*iaload
                                              ; - NestedLoop::getMin@22 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b28: cmp    %r11d,%ecx
0x0000000002da2b2b: jg     0x0000000002da2b61  ;*iinc
                                              ; - NestedLoop::getMin@33 (line 36)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b2d: mov    0x24(%r9,%rdx,4),%r11d  ;*iaload
                                              ; - NestedLoop::getMin@22 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b32: cmp    %r11d,%ecx
0x0000000002da2b35: jg     0x0000000002da2b52  ;*iinc
                                              ; - NestedLoop::getMin@33 (line 36)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b37: mov    0x28(%r9,%rdx,4),%r11d  ;*iaload
                                              ; - NestedLoop::getMin@22 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b3c: cmp    %r11d,%ecx
0x0000000002da2b3f: jg     0x0000000002da2b57  ;*iinc
                                              ; - NestedLoop::getMin@33 (line 36)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b41: mov    0x2c(%r9,%rdx,4),%r11d  ;*iaload
                                              ; - NestedLoop::getMin@22 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b46: cmp    %r11d,%ecx
0x0000000002da2b49: jg     0x0000000002da2ae6  ;*if_icmple
                                              ; - NestedLoop::getMin@23 (line 37)
                                              ; - NestedLoop::methodCall@11 (line 27)

0x0000000002da2b4b: jmp    0x0000000002da2af0

The compiled code is completely different; methodCall is optimized much better.

  • the loop has 8 iterations unrolled;
  • there is no array bounds check inside;
  • width field is cached in the register.

In contrast, inline variant

  • does not do loop unrolling;
  • loads width array from memory every time;
  • performs array bounds check on each iteration.

OSR-compiled methods are not always optimized very well, because they have to maintain the state of an interpreted stack frame at the transition point. Here is another example of the same problem.

On-stack replacement usually occurs on backward branches (i.e. at the bottom of the loop). inline method has two nested loops, and OSR happens inside the inner loop, while methodCall has just one outer loop. OSR transition in the outer loop in more favourable, because JIT compiler has more freedom to optimize the inner loop. And this is what exactly happens in your case.

Community
  • 1
  • 1
apangin
  • 92,924
  • 10
  • 193
  • 247
  • Is this really an OSR-only problem? What if you'd call `inline()` a second time? I guess, then a properly compiled method can be used, but it might still be slower (e.g., due to less unrolling in the longer method). Am I right? – maaartinus Mar 04 '17 at 12:58
  • 1
    @maaartinus Yes, after calling `inline` multiple times it becomes exactly as fast as `methodCall`. – apangin Mar 04 '17 at 14:57
4

Without doing an actual analysis, getMin is most likely being JIT compiled as you extracted it to a method that's called many times. If you are using the HotSpot JVM, this happens by default after 10,000 method executions.

You can always inspect the final code used by your application by using the right flags and JVM builds. Check out the question/answer How to see JIT-compiled code in JVM for an example how.

Community
  • 1
  • 1
NESPowerGlove
  • 5,496
  • 17
  • 28
  • 1
    Sorry for downvoting, but the answer is misleading. It does not explain why the first piece of code is slower. HotSpot JVM can easily compile methods with loops even if they don't contain other method calls. The information about 10,000 invocations is also misleading, HotSpot in JDK 8+ starts compiling much earlier. See [this answer](http://stackoverflow.com/questions/35601841/how-does-the-jvm-decided-to-jit-compile-a-method-categorize-a-method-as-hot/35614237#35614237) for details. – apangin Mar 03 '17 at 08:12
  • @apangin As far as I know the HotSpot JVM JIT compiles methods not code blocks, so if the original piece of code was contained a loop in a method called just once, then the lack of a JIT would explain why the first piece of code is slower. Perhaps my info on that is now even outdated, but I also don't see any reason for the difference off the bat which is why I believe a JIT difference is the most likely reason. – NESPowerGlove Mar 03 '17 at 16:44
  • If there is a hot loop inside a method, the whole method will be compiled and OSR-replaced. The compilation indeed takes place in both cases. However, the compiled code for the first case is less optimal than for the second case. The answer "because of JIT compilation" is kinda obvious, but does not seem to explain anything. – apangin Mar 03 '17 at 17:01
1

One advantage of Java over languages ​​compiled as C ++ is that JIT (Just in time compiler) can do optimizations from the bytecode at the time the code is executed. In addition, the Java compiler itself is prepared to do several optimizations already in the build phases. These techniques allow, for example, to turn a method call into inline code within a loop, thus avoiding repetitive method search overhead in polymorphic calls. Making a method call run inline means that the method code runs as if it had been written directly in the place where the method is called. Thus, there is no search overhead of the method to be executed, memory allocation, new context variables. Basically within your loop for, loss of processing occurs when you allocate new variables in memory (int k for example), when you pass this for into a method, you end up decreasing the overhead because the variables will already be allocated to This execution

brow-joe
  • 171
  • 1
  • 9
1

The question does not provide a reproducible test case. So I built one that focuses solely on computing the ranged minima:

git clone git@github.com:lemire/microbenchmarks.git
cd microbenchmarks
mvn clean install
java -cp target/microbenchmarks-0.0.1-jar-with-dependencies.jar me.lemire.microbenchmarks.rangequery.RangeMinimum

My results are (on a server configured for testing, Java 8):

m.l.m.r.RangeMinimum.embeddedmin    avgt        5  0.053 ± 0.009  ms/op
m.l.m.r.RangeMinimum.fncmin         avgt        5  0.052 ± 0.003  ms/op

So there is no significant performance difference in my test case between having one large loop with a subloop, or having one loop containing a function call. Note that the benchmark calls functions multiple times so that the JIT compiler can do its work.

Daniel Lemire
  • 3,470
  • 2
  • 25
  • 23
-2

I believe java is doing some optimization/memoization. It could cache the results of functions if functions/methods are pure. I believe your time has decreased but your space/memory will increase (due to memoization) and vice versa.

hhafeez
  • 73
  • 9