Why does a 4 billion-iteration Java loop take only 2 ms?

Question

I'm running the following Java code on a laptop with 2.7 GHz Intel Core i7. I intended to let it measure how long it takes to finish a loop with 2^32 iterations, which I expected to be roughly 1.48 seconds (4/2.7 = 1.48).

But actually it only takes 2 milliseconds, instead of 1.48 s. I'm wondering if this is a result of any JVM optimization underneath?

public static void main(String[] args)
{
    long start = System.nanoTime();

    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++){
    }
    long finish = System.nanoTime();
    long d = (finish - start) / 1000000;

    System.out.println("Used " + d);
}

Well yes. Because the loop body has no side-effects, the compiler quite happily eliminates it. Examine the byte-code with `javap -v` to see. — Elliott Frisch, Dec 24 '17 at 01:13
You won't see that back in the byte-code. `javac` does very little actual optimization and leaves most of it to the JIT compiler. — Jorn Vernee, Dec 24 '17 at 01:22
In Eclipse if you run without any breakpoints it optimizes away the loop. If you set a breakpoint anywhere it executes the loop. — Jim Garrison, Dec 24 '17 at 01:34
@zzhang Intel processors also optimize such that repeated instructions are cached and hence does not require recalculation of the logic, only the data part that gets calculated is not fully cached (but guessed accurately). This might also have to do with the extreme efficiency. — Rahul, Dec 24 '17 at 08:45
*'I'm wondering if this is a result of any JVM optimization underneath?'* - What do you think? What else could it be if not a JVM optimization? — apangin, Dec 24 '17 at 14:38
The answer to this question is basically contained in https://stackoverflow.com/a/25323548/3182664 . It also contains the resulting assembly (machine code) that the JIT generates for such cases, showing that **the loop is completely optimized away by the JIT**. (The question at https://stackoverflow.com/q/25326377/3182664 shows that it might take a bit longer if the loop does not do 4 billion operations, but 4 billion **minus one** ;-) ). I'd nearly consider this question as a duplicate of the other - any objections? — Marco13, Dec 24 '17 at 14:38
@ElliottFrisch Seriously, with close to 134 _thousand_ reputation points you should have seen the message printed when writing a comment: _Avoid answering questions in comments._ It's there for a reason. — pipe, Dec 24 '17 at 16:39
Here's some background on compiler optimisations and how they use the real processor rather than the virtual x86 that we program too. https://blogs.msdn.microsoft.com/oldnewthing/20041216-00/?p=36973 and https://blogs.msdn.microsoft.com/oldnewthing/20140627-00/?p=633 — ACatInLove, Dec 24 '17 at 21:34
You assume the processor will perform one iteration per Hz. That is a far-reaching assumption. Processors today perform all sorts of optimizations, as @Rahul mentioned, and unless you know much more about how the Core i7 works, you cannot assume that. — Tsahi Asher, Dec 25 '17 at 11:03
On top of what Tsahi says, for each loop the processor has to, at least: increase counter, check, jump if. There's no way that's happening in one cycle. — Martin Argerami, Dec 25 '17 at 15:46
Java is actually as slow as taking 2ms to finish an optimized out no-effect loop expression. — hkBattousai, Dec 26 '17 at 07:00
@hkBattousai That's a good point, but I'm skeptical - the resolution of the timer is dependent on the platform, and the measurements themselves are expensive, *especially* nanotime. [More information](https://web.archive.org/web/20160308031939/https://blogs.oracle.com/dholmes/entry/inside_the_hotspot_vm_clocks) found via [another SO question](https://stackoverflow.com/questions/351565/system-currenttimemillis-vs-system-nanotime#351571). To be clear, nanotime is the right tool for the job ("monotonically increasing elapsed time"), just keep the 'observer effect' in the back of your mind. — John P, Dec 27 '17 at 00:45
@hkBattousai: it’s a one-time execution within the `main` method. So it starts unoptimized, interpreted, before at some time, the optimizer kicks in. Such a scenario also depends on the presence of on-stack-replacement, to make the optimized version effective. That’s why real microbenchmarks work entirely different. — Holger, Dec 28 '17 at 12:13

vandench · Answer 1 · 2017-12-28T13:41:51.913

107

There are one of two possibilities going on here:

The compiler realized that the loop is redundant and doing nothing so it optimized it away.
The JIT (just-in-time compiler) realized that the loop is redundant and doing nothing, so it optimized it away.

Modern compilers are very intelligent; they can see when code is useless. Try putting an empty loop into GodBolt and look at the output, then turn on -O2 optimizations, you will see that the output is something along the lines of

main():
    xor eax, eax
    ret

I would like to clarify something, in Java most of the optimizations are done by the JIT. In some other languages (like C/C++) most of the optimizations are done by the first compiler.

edited Dec 28 '17 at 13:41

answered Dec 24 '17 at 01:21

vandench

1,973
3
19
28

Is the compiler permitted to do such optimisations? I don't know for sure for Java, but .NET compilers should generally avoid this to allow the JIT do the best optimisations for the platform. – IS4 Dec 24 '17 at 10:39
1

@IllidanS4 In general, this depends on the language standard. If the compiler can perform optimisations that mean the code, interpreted by the standard, has the same effect, then yes. There are many subtleties one has to consider though, e.g. there are some transformations for floating point computations that can result in the possibility of over/underflow being introduced, so any optimisation has to be carefully done. – user1997744 Dec 24 '17 at 11:56
9

@IllidanS4 how should the runtime environment be able to do better optimization? At the very least it must analyze the code which cannot be faster than removing the code during compilation. – Gerhardh Dec 24 '17 at 12:20
@Gerhardh: That's really unrelated to this question, but from the top of my head: 1) The runtime knows the processor it is running on 2) The runtime can make use of execution statistics to guide speculative optimization (e.g. inline calls to virtual functions that happen to be monomorphic during a particular execution). – meriton Dec 24 '17 at 14:05
2

@Gerhardh I wasn't talking about this precise case, when the runtime cannot do a better job at removing redundant parts of code, but there may be of course some cases when this reason is correct. And because there can be other compilers for JRE from other languages, the runtime *should* also do these optimisations, so there is potentially no reason for them to by done both by the runtime and the compiler. – IS4 Dec 24 '17 at 14:06
6

@IllidanS4 any runtime optimization cannot take less than zero time. Preventing the compiler from removing the code would not make any sense. – Gerhardh Dec 24 '17 at 14:28
@meriton the question is about an empty loop. This can easily be detected by the compiler without knowing CPU etc.. For any code having some effect the RE might be able to do a better job but that's another question. – Gerhardh Dec 24 '17 at 14:30
It's number 2. The compiler (`javac`) does only very few optimizations, and one can have a look at the bytecode to see that the loop is still there. However, the JIT will remove the loop, as it can be seen in the resulting machine code. More details at https://stackoverflow.com/a/25323548/3182664 – Marco13 Dec 24 '17 at 14:41
@IllidanS4: `javac` doesn’t do this, but generally, the Java Language Specification makes no difference between the compilers, but just defines legal executions. So if this optimization is legal for the JIT, it’s also legal for `javac`, but, as said, `javac` does very little in this regard. – Holger Dec 28 '17 at 12:05

Akavall · Answer 2 · 2017-12-27T17:37:15.983

56

It looks like it was optimized away by JIT compiler. When I turn it off (-Djava.compiler=NONE), the code runs much slower:

$ javac MyClass.java
$ java MyClass
Used 4
$ java -Djava.compiler=NONE MyClass
Used 40409

I put OP's code inside of class MyClass.

edited Dec 27 '17 at 17:37

answered Dec 24 '17 at 06:52

Akavall

82,592
51
207
251

2

Weird. When I run the code both ways, it _is_ faster without the flag, but only by a factor of 10, and adding or removing zeros to the number of iterations in the loop also affects the running time by factors of ten, with and without the flag. So (for me) the loop seems not to be optimized away entirely, just made 10 times faster, somehow. (Oracle Java 8-151) – tobias_k Dec 24 '17 at 14:08
@tobias_k it depends on what stage of the JIT the loop is going through i guess https://stackoverflow.com/a/47972226/1059372 – Eugene Dec 25 '17 at 21:49

score 21 · Answer 3 · answered Dec 25 '17 at 21:46

I just will state the obvious - that this is a JVM optimization that happens, the loop will simply be remove at all. Here is a small test that shows what a huge difference JIT has when enabled/enabled only for C1 Compiler and disabled at all.

Disclaimer: don't write tests like this - this is just to prove that the actual loop "removal" happens in the C2 Compiler:

@Benchmark
@Fork(1)
public void full() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
        ++result;
    }
}

@Benchmark
@Fork(1)
public void minusOne() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

@Benchmark
@Fork(value = 1, jvmArgsAppend = { "-XX:TieredStopAtLevel=1" })
public void withoutC2() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

@Benchmark
@Fork(value = 1, jvmArgsAppend = { "-Xint" })
public void withoutAll() {
    long result = 0;
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE - 1; i++) {
        ++result;
    }
}

The results show that depending on which part of the JIT is enabled, method gets faster (so much faster that it looks like it's doing "nothing" - loop removal, which seems to be happening in the C2 Compiler - which is the maximum level):

 Benchmark                Mode  Cnt      Score   Error  Units
 Loop.full        avgt    2     ≈ 10⁻⁷          ms/op
 Loop.minusOne    avgt    2     ≈ 10⁻⁶          ms/op
 Loop.withoutAll  avgt    2  51782.751          ms/op
 Loop.withoutC2   avgt    2   1699.137          ms/op

Oleksandr Pyrohov · Answer 4 · 2018-01-17T19:11:05.050

As already pointed out, JIT (just-in-time) compiler can optimize an empty loop in order to remove unnecessary iterations. But how?

Actually, there are two JIT compilers: C1 & C2. First, the code is compiled with the C1. C1 collects the statistics and helps the JVM to discover that in 100% cases our empty loop doesn't change anything and is useless. In this situation C2 enters the stage. When the code is get called very often, it can be optimized and compiled with the C2 using collected statistics.

As an example, I will test the next code snippet (my JDK is set to slowdebug build 9-internal):

public class Demo {
    private static void run() {
        for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
        }
        System.out.println("Done!");
    }
}

With the following command line options:

-XX:+UnlockDiagnosticVMOptions -XX:CompileCommand=print,*Demo.run

And there are different versions of my run method, compiled with the C1 and C2 appropriately. For me, the final variant (C2) looks something like this:

...

; B1: # B3 B2 <- BLOCK HEAD IS JUNK  Freq: 1
0x00000000125461b0: mov   dword ptr [rsp+0ffffffffffff7000h], eax
0x00000000125461b7: push  rbp
0x00000000125461b8: sub   rsp, 40h
0x00000000125461bc: mov   ebp, dword ptr [rdx]
0x00000000125461be: mov   rcx, rdx
0x00000000125461c1: mov   r10, 57fbc220h
0x00000000125461cb: call  indirect r10    ; *iload_1

0x00000000125461ce: cmp   ebp, 7fffffffh  ; 7fffffff => 2147483647
0x00000000125461d4: jnl   125461dbh       ; jump if not less

; B2: # B3 <- B1  Freq: 0.999999
0x00000000125461d6: mov   ebp, 7fffffffh  ; *if_icmpge

; B3: # N44 <- B1 B2  Freq: 1       
0x00000000125461db: mov   edx, 0ffffff5dh
0x0000000012837d60: nop
0x0000000012837d61: nop
0x0000000012837d62: nop
0x0000000012837d63: call  0ae86fa0h

...

It is a little bit messy, but If you look closely, you may notice that there is no long running loop here. There are 3 blocks: B1, B2 and B3 and the execution steps can be B1 -> B2 -> B3 or B1 -> B3. Where Freq: 1 - normalized estimated frequency of a block execution.

score 8 · Answer 5 · answered Jan 02 '18 at 16:40

You are measuring the time it take to detect the loop doesn't do anything, compile the code in a background thread and eliminate the code.

for (int t = 0; t < 5; t++) {
    long start = System.nanoTime();
    for (int i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
    }
    long time = System.nanoTime() - start;

    String s = String.format("%d: Took %.6f ms", t, time / 1e6);
    Thread.sleep(50);
    System.out.println(s);
    Thread.sleep(50);
}

If you run this with -XX:+PrintCompilation you can see the code has been compiled in the background to level 3 or C1 compiler and after a few loops to level 4 of C4.

    129   34 %     3       A::main @ 15 (93 bytes)
    130   35       3       A::main (93 bytes)
    130   36 %     4       A::main @ 15 (93 bytes)
    131   34 %     3       A::main @ -2 (93 bytes)   made not entrant
    131   36 %     4       A::main @ -2 (93 bytes)   made not entrant
0: Took 2.510408 ms
    268   75 %     3       A::main @ 15 (93 bytes)
    271   76 %     4       A::main @ 15 (93 bytes)
    274   75 %     3       A::main @ -2 (93 bytes)   made not entrant
1: Took 5.629456 ms
2: Took 0.000000 ms
3: Took 0.000364 ms
4: Took 0.000365 ms

If you change the loop to use a long it doesn't get as optimised.

    for (long i = Integer.MIN_VALUE; i < Integer.MAX_VALUE; i++) {
    }

instead you get

0: Took 1579.267321 ms
1: Took 1674.148662 ms
2: Took 1885.692166 ms
3: Took 1709.870567 ms
4: Took 1754.005112 ms

That's strange... why would a `long` counter prevent the same optimization from happening? — Ryan Amos, Jan 15 '18 at 16:06
@RyanAmos the optimisation is only applied to the common primitive loop count if type `int` note char and short are effectively the same at the byte code level. — Peter Lawrey, Jan 15 '18 at 17:36

score -2 · Answer 6 · answered Jan 12 '18 at 09:52

-2

You consider start and finish time in nanosecond and you divide by 10^6 for calculate the latency

long d = (finish - start) / 1000000

it should be 10^9 because 1 second = 10^9 nanosecond.

answered Jan 12 '18 at 09:52

DHARMENDRA SINGH

607
5
21

What you suggested is irrelevant to my point. What I was wondering about is how long it took, and it doesn't matter whether this duration is printed/represented in terms of milli-second or second. – twimo Jan 12 '18 at 17:31

Why does a 4 billion-iteration Java loop take only 2 ms?

6 Answers6