For loop performance: counters with same value vs. different values

Question

I have a loop with 2 counters: i and j. If they have the same value - iteration works much faster than if their values differ:

Benchmark                     Mode  Cnt       Score      Error  Units
FloatsArrayBenchmark.times   thrpt   20  341805.800 ± 1623.320  ops/s
FloatsArrayBenchmark.times2  thrpt   20  198764.909 ± 1608.387  ops/s

Java bytecode is identical, which means it's related to some lower level optimizations. Can someone explain why this is happening? Here's the benchmark:

import org.openjdk.jmh.annotations.*;

public class FloatsArrayBenchmark {
    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(new String[]{FloatsArrayBenchmark.class.getSimpleName()});
    }

    @Benchmark @Fork(value = 1, warmups = 0)
    public void times(Data data) {
        float[] result = new float[10000];;
        for (int i = 0, j=0; i < 9_999; i++,j++)
            result[j] = data.floats[i] * 10;
    }
    @Benchmark @Fork(value = 1, warmups = 0)
    public void times2(Data data) {
        float[] result = new float[10000];
        for (int i = 0,j=1; i < 9_999; i++,j++)
            result[j] = data.floats[i] * 10;
    }

    @State(Scope.Benchmark)
    public static class Data {
        private final float[] floats = new float[10000];
    }
}

Environment:

MacOS, tried Java8, Java11, Java14
2,4 GHz Quad-Core Intel Core i5

Could be 4k aliasing (false dependency for later loads) from storing 1 index ahead of where you're reading, if both src and dst arrays are at the same alignment relative to a 4k page boundary. Java8 is probably too old to be auto-vectorizing with SIMD for even the simplest case (same indices). — Peter Cordes, Sep 07 '20 at 21:26
I think the first one is slower because of the double semicolon `;;` ...... *NOT* ...... — Andreas, Sep 07 '20 at 21:28
@PeterCordes, Java11 - same results. Downloading Java14 to try it out. Not sure about aliasing - will have to read on that. — Stanislav Bashkyrtsev, Sep 07 '20 at 21:41
At this point, you'll have to do an assembly dump. My guess would be something to do with cache boundaries, as Peter suggested, but if you really want to know, dig deeper. — chrylis -cautiouslyoptimistic-, Sep 07 '20 at 21:51

Bohemian · Answer 1 · 2020-09-07T23:38:08.470

3

In the first (faster) version, i always (effectively) has the same value as j, so it:

public void times(Data data) {
    float[] result = new float[10000];;
    for (int i=0, j=0; i < 9_999; i++,j++)
        result[j] = data.floats[i] * 10;
}

can be re-written without j with identical effect:

public void times(Data data) {
    float[] result = new float[10000];;
    for (int i = 0; i < 9_999; i++)
        result[i] = data.floats[i] * 10;
}

It is likely that the compiler recognised thatj is redundant and eliminated it, resulting in half the number of ++ operations performed, which accounts for 1/3 of all aritmetic operations. This is consistent with the timings: the second version takes 70% longer per iteration. 70% is approxiately 50%, the result expected for a ratio of 3:2 operations.

edited Sep 07 '20 at 23:38

answered Sep 07 '20 at 23:30

Bohemian

412,405
93
575
722

A slightly smarter compiler could still CSE them and use a constant 4-byte offset as part of the addressing mode for `result[i+1]`, or do the load before an increment and the store after an increment. e.g. clang has no problem with this when compiling C ahead-of-time (https://godbolt.org/z/aeKMrG). But it wouldn't surprise me if HotSpot's JIT missed this simple optimization. – Peter Cordes Sep 08 '20 at 05:32
To a first approximation, one extra instruction in a small loop without unrolling can increase run-time by that much, on modern CPU front-ends like Haswell and later that don't have such a strong multiple-of-front-end-width effect ([Is performance reduced when executing loops whose uop count is not a multiple of processor width?](https://stackoverflow.com/q/39311872) - yes on Sandybridge, not as much on HSW). With unrolling it makes no sense, unless the compiler was really bad at unrolling and did each integer increment instead of using addressing-mode offsets. – Peter Cordes Sep 08 '20 at 05:34
Could be.. Though not sure if calculations add up - there's also a condition for the end of the loop. Also Java checks whether we went beyond the array size, I bet there are plenty of additional operations CPU has to perform. – Stanislav Bashkyrtsev Sep 08 '20 at 08:02

For loop performance: counters with same value vs. different values

1 Answers1