TL;DR JIT compiler has more opportunities to optimize the inner loop in the second case, because on-stack replacement happens at the different point.
I've managed to reproduce the problem with the reduced test case.
No I/O or string operations involved, just two nested loops with array access.
public class NestedLoop {
private static final int ARRAY_SIZE = 5000;
private static final int ITERATIONS = 1000000;
private int[] width = new java.util.Random(0).ints(ARRAY_SIZE).toArray();
public long inline() {
long sum = 0;
for (int i = 0; i < ITERATIONS; i++) {
int min = width[0];
for (int k = 1; k < ARRAY_SIZE; k++) {
if (min > width[k]) {
min = width[k];
}
}
sum += min;
}
return sum;
}
public long methodCall() {
long sum = 0;
for (int i = 0; i < ITERATIONS; i++) {
int min = getMin();
sum += min;
}
return sum;
}
private int getMin() {
int min = width[0];
for (int k = 1; k < ARRAY_SIZE; k++) {
if (min > width[k]) {
min = width[k];
}
}
return min;
}
public static void main(String[] args) {
long startTime = System.nanoTime();
long sum = new NestedLoop().inline(); // or .methodCall();
long endTime = System.nanoTime();
long ms = (endTime - startTime) / 1000000;
System.out.println("sum = " + sum + ", time = " + ms + " ms");
}
}
inline
variant indeed works 3-4 times slower than methodCall
.
I've used the following JVM options to confirm that both benchmarks are compiled on the highest tier and OSR (on-stack replacement) successfully occurs in both cases.
-XX:-TieredCompilation
-XX:CompileOnly=NestedLoop
-XX:+UnlockDiagnosticVMOptions
-XX:+PrintCompilation
-XX:+TraceNMethodInstalls
'inline' compilation log:
251 46 % NestedLoop::inline @ 21 (70 bytes)
Installing osr method (4) NestedLoop.inline()J @ 21
'methodCall' compilation log:
271 46 NestedLoop::getMin (41 bytes)
Installing method (4) NestedLoop.getMin()I
274 47 % NestedLoop::getMin @ 9 (41 bytes)
Installing osr method (4) NestedLoop.getMin()I @ 9
314 48 % NestedLoop::methodCall @ 4 (30 bytes)
Installing osr method (4) NestedLoop.methodCall()J @ 4
This means JIT does its job, but the generated code must be different.
Let's analyze it with -XX:+PrintAssembly
.
'inline' disassembly (the hottest fragment)
0x0000000002df4dd0: inc %ebp ; OopMap{r11=Derived_oop_rbx rbx=Oop off=114}
;*goto
; - NestedLoop::inline@53 (line 12)
0x0000000002df4dd2: test %eax,-0x1d64dd8(%rip) # 0x0000000001090000
;*iload
; - NestedLoop::inline@21 (line 12)
; {poll}
0x0000000002df4dd8: cmp $0x1388,%ebp
0x0000000002df4dde: jge 0x0000000002df4dfd ;*if_icmpge
; - NestedLoop::inline@26 (line 12)
0x0000000002df4de0: test %rbx,%rbx
0x0000000002df4de3: je 0x0000000002df4e4c
0x0000000002df4de5: mov (%r11),%r10d ;*getfield width
; - NestedLoop::inline@32 (line 13)
0x0000000002df4de8: mov 0xc(%r10),%r9d ; implicit exception
0x0000000002df4dec: cmp %r9d,%ebp
0x0000000002df4def: jae 0x0000000002df4e59
0x0000000002df4df1: mov 0x10(%r10,%rbp,4),%r8d ;*iaload
; - NestedLoop::inline@37 (line 13)
0x0000000002df4df6: cmp %r8d,%r13d
0x0000000002df4df9: jg 0x0000000002df4dc6 ;*if_icmple
; - NestedLoop::inline@38 (line 13)
0x0000000002df4dfb: jmp 0x0000000002df4dd0
'methodCall' disassembly (also the hottest part)
0x0000000002da2af0: add $0x8,%edx ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2af3: cmp $0x1381,%edx
0x0000000002da2af9: jge 0x0000000002da2b70 ;*iload_1
; - NestedLoop::getMin@16 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2afb: mov 0x10(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b00: cmp %r11d,%ecx
0x0000000002da2b03: jg 0x0000000002da2b6b ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b05: mov 0x14(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b0a: cmp %r11d,%ecx
0x0000000002da2b0d: jg 0x0000000002da2b5c ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b0f: mov 0x18(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b14: cmp %r11d,%ecx
0x0000000002da2b17: jg 0x0000000002da2b4d ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b19: mov 0x1c(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b1e: cmp %r11d,%ecx
0x0000000002da2b21: jg 0x0000000002da2b66 ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b23: mov 0x20(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b28: cmp %r11d,%ecx
0x0000000002da2b2b: jg 0x0000000002da2b61 ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b2d: mov 0x24(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b32: cmp %r11d,%ecx
0x0000000002da2b35: jg 0x0000000002da2b52 ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b37: mov 0x28(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b3c: cmp %r11d,%ecx
0x0000000002da2b3f: jg 0x0000000002da2b57 ;*iinc
; - NestedLoop::getMin@33 (line 36)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b41: mov 0x2c(%r9,%rdx,4),%r11d ;*iaload
; - NestedLoop::getMin@22 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b46: cmp %r11d,%ecx
0x0000000002da2b49: jg 0x0000000002da2ae6 ;*if_icmple
; - NestedLoop::getMin@23 (line 37)
; - NestedLoop::methodCall@11 (line 27)
0x0000000002da2b4b: jmp 0x0000000002da2af0
The compiled code is completely different; methodCall
is optimized much better.
- the loop has 8 iterations unrolled;
- there is no array bounds check inside;
width
field is cached in the register.
In contrast, inline
variant
- does not do loop unrolling;
- loads
width
array from memory every time;
- performs array bounds check on each iteration.
OSR-compiled methods are not always optimized very well, because they have to maintain the state of an interpreted stack frame at the transition point. Here is another example of the same problem.
On-stack replacement usually occurs on backward branches (i.e. at the bottom of the loop). inline
method has two nested loops, and OSR happens inside the inner loop, while methodCall
has just one outer loop. OSR transition in the outer loop in more favourable, because JIT compiler has more freedom to optimize the inner loop. And this is what exactly happens in your case.