Java for-loop optimization

Question

I made some runtime tests with java for loops and recognized a strange behaviour. For my code I need wrapper objects for primitive types like int, double and so on, to simulate io and output parameters, but thats not the point. Just watch my code. How can objects with field access be faster then primitive types?

for loop with prtimitive type:

public static void main(String[] args) {
    double max = 1000;
    for (int j = 1; j < 8; j++) {
        double i;
        max = max * 10;
        long start = System.nanoTime();
        for (i = 0; i < max; i++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime primitive(max: ="+max + "): " + microseconds);
    }
}

Result:

MicroTime primitive(max: =10000.0): 110
MicroTime primitive(max: =100000.0): 1081
MicroTime primitive(max: =1000000.0): 2450
MicroTime primitive(max: =1.0E7): 28248
MicroTime primitive(max: =1.0E8): 276205
MicroTime primitive(max: =1.0E9): 2729824
MicroTime primitive(max: =1.0E10): 27547009

for loop with simple type (wrapper object):

public static void main(String[] args) {
    HDouble max = new HDouble();
    max.value = 1000;
    for (int j = 1; j < 8; j++) {
        HDouble i = new HDouble();
        max.value = max.value*10;
        long start = System.nanoTime();
        for (i.value = 0; i.value <max.value; i.value++) {
        }
        long end = System.nanoTime();
        long microseconds = (end - start) / 1000;
        System.out.println("MicroTime wrapper(max: ="+max.value + "): " + microseconds);
    }
}

Result:

MicroTime wrapper(max: =10000.0): 157
MicroTime wrapper(max: =100000.0): 1561
MicroTime wrapper(max: =1000000.0): 3174
MicroTime wrapper(max: =1.0E7): 15630
MicroTime wrapper(max: =1.0E8): 155471
MicroTime wrapper(max: =1.0E9): 1520967
MicroTime wrapper(max: =1.0E10): 15373311

The more iterations, the faster is the second code. But why? I know that the java-compiler and jvm are optimizing my code, but I never thought that primitive types can be slower, than objects with field access.
Does anyone have a plausible explanation for it?

Edited: HDouble class:

public class HDouble {
    public double value;

    public HDouble() {
    }

    public HDouble(double value) {
        this.value = value;
    }

    @Override
    public String toString() {
        return String.valueOf(value);
    }
}

I also tested my loops with code in it. For example I calculate the sum -> same behaviour (the difference is not that big, but I thought the primitive algorithm have to be much faster?). First I thought, that the calculation takes that long, that the field access nearly no difference.

Wrapper for-loop:

for (i.value = 0; i.value <max.value; i.value++) {
    sum.value = sum.value + i.value;
}

Result:

MicroTime wrapper(max: =10000.0): 243
MicroTime wrapper(max: =100000.0): 2805
MicroTime wrapper(max: =1000000.0): 3409
MicroTime wrapper(max: =1.0E7): 28104
MicroTime wrapper(max: =1.0E8): 278432
MicroTime wrapper(max: =1.0E9): 2678322
MicroTime wrapper(max: =1.0E10): 26665540

Primitive for-loop:

for (i = 0; i < max; i++) {
    sum = sum + i;
}

Result:

MicroTime primitive(max: =10000.0): 149
MicroTime primitive(max: =100000.0): 1996
MicroTime primitive(max: =1000000.0): 2289
MicroTime primitive(max: =1.0E7): 27085
MicroTime primitive(max: =1.0E8): 279939
MicroTime primitive(max: =1.0E9): 2759133
MicroTime primitive(max: =1.0E10): 27369724

I suggest putting some live code inside the loop. Empty loops are rare in real programs, or even in serious benchmarks, so optimization for them may not be very consistent or well tested. — Patricia Shanahan, Nov 21 '15 at 10:20
On my machine using jdk1.8.0_45_x84 the second loop is about 3 times slower that the first one. — tibtof, Nov 21 '15 at 10:56
The problem with tests like that is that they aren't very accurate. You're measuring some kind of jvm performance but not necessarily how the code compares once optimized since that doesn't happen until runtime. http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java Also system load between tests can have an impact etc so you'll need to repeat tests. The longer running tests in the end could have a meaning. Though the JVM rewrites code so much that it's hard to tell what it does and why. — zapl, Nov 21 '15 at 10:59
mhm and how can i make a accurate loop test? @ tibtof: i am using jdk 1.7, maybe i should update it — user3490546, Nov 21 '15 at 11:03
I don't think you're interested in how well the jvm can optimize certain loops. It looks more like you want to know whether `HDouble` is feasible (it is) or has a performance impact. The answer to that isn't in those tests since they test just a very specific scenario that is unlikely to happen in real code. The jvm will cache field access, inline getters, rewrite loops to run in inverse order if it sees a benefit or even skip code that it knows isn't going to make an observable difference. And it's capable of revising the optimization when it finds a better one or sees a problem with it. — zapl, Nov 21 '15 at 11:11
@ tibtof: has you turned off jit compiler? When i turn it off, i get nearly the same result. 10billion iterations: wrapper(300seconds), primitive(132seconds) — user3490546, Nov 21 '15 at 11:18

score 11 · Accepted Answer · edited May 23 '17 at 12:33

It's so easy to get fooled by hand-made microbenchmarks - you never know what they actually measure. That's why there are special tools like JMH. But let's analyze what happens to the primitive hand-made benchmark:

static class HDouble {
    double value;
}

public static void main(String[] args) {
    primitive();
    wrapper();
}

public static void primitive() {
    long start = System.nanoTime();
    for (double d = 0; d < 1000000000; d++) {
    }
    long end = System.nanoTime();
    System.out.printf("Primitive: %.3f s\n", (end - start) / 1e9);
}

public static void wrapper() {
    HDouble d = new HDouble();
    long start = System.nanoTime();
    for (d.value = 0; d.value < 1000000000; d.value++) {
    }
    long end = System.nanoTime();
    System.out.printf("Wrapper:   %.3f s\n", (end - start) / 1e9);
}

The results are somewhat similar to yours:

Primitive: 3.618 s
Wrapper:   1.380 s

Now repeat the test several times:

public static void main(String[] args) {
    for (int i = 0; i < 5; i++) {
        primitive();
        wrapper();
    }
}

It gets more interesting:

Primitive: 3.661 s
Wrapper:   1.382 s
Primitive: 3.461 s
Wrapper:   1.380 s
Primitive: 1.376 s <-- starting from 3rd iteration
Wrapper:   1.381 s <-- the timings become equal
Primitive: 1.371 s
Wrapper:   1.372 s
Primitive: 1.379 s
Wrapper:   1.378 s

Looks like both methods got finally optimized. Run it once again, now with logging JIT compiler activity: -XX:-TieredCompilation -XX:CompileOnly=Test -XX:+PrintCompilation

    136    1 %           Test::primitive @ 6 (53 bytes)
   3725    1 %           Test::primitive @ -2 (53 bytes)   made not entrant
Primitive: 3.589 s
   3748    2 %           Test::wrapper @ 17 (73 bytes)
   5122    2 %           Test::wrapper @ -2 (73 bytes)   made not entrant
Wrapper:   1.374 s
   5122    3             Test::primitive (53 bytes)
   5124    4 %           Test::primitive @ 6 (53 bytes)
Primitive: 3.421 s
   8544    5             Test::wrapper (73 bytes)
   8547    6 %           Test::wrapper @ 17 (73 bytes)
Wrapper:   1.378 s
Primitive: 1.372 s
Wrapper:   1.375 s
Primitive: 1.378 s
Wrapper:   1.373 s
Primitive: 1.375 s
Wrapper:   1.378 s

Note % sign in the compilation log on the first iteration. It means that the methods were compiled in OSR (on-stack replacement) mode. During the second iteration the methods were recompiled in normal mode. Since then, starting from the third iteration, there was no difference between primitive and wrapper in execution speed.

What you've actually measured is the performance of OSR stub. It is usually not related to the real performance of an application and you shouldn't care much about it.

But the question still remains, why OSR stub for a wrapper is compiled better than for a primitive variable? To find this out we need to get down to generated assembly code:
-XX:CompileOnly=Test -XX:+UnlockDiagnosticVMOptions -XX:+PrintAssembly

I'll omit all unrelevant code leaving only the compiled loop.

Primitive:

0x00000000023e90d0: vmovsd 0x28(%rsp),%xmm1      <-- load double from the stack
0x00000000023e90d6: vaddsd -0x7e(%rip),%xmm1,%xmm1
0x00000000023e90de: test   %eax,-0x21f90e4(%rip)
0x00000000023e90e4: vmovsd %xmm1,0x28(%rsp)      <-- store to the stack
0x00000000023e90ea: vucomisd 0x28(%rsp),%xmm0    <-- compare with the stack value
0x00000000023e90f0: ja     0x00000000023e90d0

Wrapper:

0x00000000023ebe90: vaddsd -0x78(%rip),%xmm0,%xmm0
0x00000000023ebe98: vmovsd %xmm0,0x10(%rbx)      <-- store to the object field
0x00000000023ebe9d: test   %eax,-0x21fbea3(%rip)
0x00000000023ebea3: vucomisd %xmm0,%xmm1         <-- compare registers
0x00000000023ebea7: ja     0x00000000023ebe90

As you can see, the 'primitive' case makes a number of loads and stores to a stack location while 'wrapper' does mostly in-register operations. It is quite understandable why OSR stub refers to stack: in the interpreted mode local variables are stored on the stack, and OSR stub is made compatible with this interpreted frame. In a 'wrapper' case the value is stored on the heap, and the reference to the object is already cached in a register.

Java for-loop optimization

1 Answers1

Linked