5
int steps = 256 * 1024 * 1024;
int[] a = new int[2];

// Loop 1
for (int i=0; i<steps; i++) { a[0]++; a[0]++; }

// Loop 2
for (int i=0; i<steps; i++) { a[0]++; a[1]++; }

Can someone explain why the second loop is 20x times slower than the first (19 ms vs 232 ms)?

That is how I'm timing it:

long start_time = System.currentTimeMillis();

// Loop

long end_time = System.currentTimeMillis();
System.out.println(end_time - start_time);
Boann
  • 48,794
  • 16
  • 117
  • 146
  • 2
    look at the bytecode, first one probably got optimized out completely – Sopel Jul 21 '17 at 19:28
  • 3
    It would help if you provided your timing code in a full MVCE. – merlin2011 Jul 21 '17 at 19:28
  • 1
    Actually, I just timed it (System.currentTimeMillis() before, between, and after each loop), and got roughly what they did. – 17slim Jul 21 '17 at 19:31
  • 1
    How many times did you time it? You should do at least 10.000 iterations to be sure you have an accurate number – Daniel Jul 21 '17 at 19:31
  • 1
    Timed it like 20 times, got almost the exact same for both timings each iteration – 17slim Jul 21 '17 at 19:32
  • Although I understand the concern now... Iterating tests is quite different than running several times. The JVM has to be warmed up first I believe, so iterating the test multiple times would give it ample time to normalize. – 17slim Jul 21 '17 at 19:39
  • Iterated it 200 times and both loops were in the 30-40 ms range. Strange how after a bit of warmup the second loop would be slower, however. – 17slim Jul 21 '17 at 19:39
  • Making the loops identical yields the same results. I would almost certainly chalk this up to being a result of the JVM warming up. see: https://stackoverflow.com/questions/36198278/why-does-the-jvm-require-warmup – 17slim Jul 21 '17 at 19:43
  • 1
    You should really mention the platform and JVM in java performance question, it's obviously going to vary. – harold Jul 21 '17 at 19:55
  • Such stuff is hard to measure and analyse correctly, lots of optimizations everywhere. Even your CPU will optimize this by loading complete lines up in the cache hierarchy. When accessing a[0] the CPU will directly load a[1] and many following elements into the cache or move it up in the cache hierarchy, following accesses will thus be faster. – Zabuzard Jul 21 '17 at 20:24
  • On my system, the first loop isn't a loop (it takes constant time, independent of `steps`) while the second loop *is* a loop and scales with `steps`. – harold Jul 21 '17 at 20:32
  • @Sopel I doubt you'll find anything in the bytecode. Optimization generally happens at runtime. – shmosel Jul 21 '17 at 21:06

1 Answers1

9

Summary

The JIT compiler is converting the first loop into a multiply, but not optimizing the second loop very much.

Discussion

The bytecode for both loops is basically the same (you can view this with javap -c test.class).

In Java, the bytecode is converted into x86 instructions by a JIT compiler which has the ability to perform additional optimizations.

You can actually view the assembly produced by the JIT via java -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly ... if you have the hsdis plugin.

I changed the value you add to each element to 0xbad to make it easier to spot the relevant code and changed the loop counter to long.

The first loop produces:

  mov     r11d,dword ptr [r13+10h]    Load from memory a[0]
  ...
  add     r11d,175ah                  Add 2 * 0xbad to the value
  mov     dword ptr [r13+10h],r11d    Store to memory a[0]

The second loop produces:

   mov     ebx,dword ptr [rax+10h]    Load from memory a[0]
   add     ebx,0badh                  Add 0xbad
   ...
   mov     dword ptr [rax+10h],ebx    Store to memory
   ...
   mov     ebx,dword ptr [rax+14h]    Load from memory a[1]
   add     ebx,0badh                  Add 0xbad
   ...
   mov     dword ptr [rax+14h],ebx    Store to memory a[1]

so you can see that the compiler is already able to optimize the first loop into fewer instructions.

In particular, it has spotted that the two additions to the same array element can be coalesced into a single addition of twice the value.

When I changed the loop counter back to int I noticed that the compiler manages to do even better with your first loop:

mov     r10d,dword ptr [r14+10h]
imul    ecx,r13d,175ah     This line converts lots of adds of 0xbad into a single multiply  
mov     r11d,r10d
sub     r11d,ecx
add     r10d,175ah
mov     dword ptr [r14+10h],r10d

In this case it has spotted that it can actually implement several iterations of your loop in a single pass by using a multiplication! This explains how the first loop can be an order of magnitude faster than the second.

Peter de Rivaz
  • 33,126
  • 4
  • 46
  • 75
  • 1
    Can you explain the conclusion further? What exactly is the difference? Why is this relevant for the speed and so on. I think this helps OP even more than just showing him stuff that he possibly has never seen before. – Zabuzard Jul 21 '17 at 21:06