9

I have a loop with 2 counters: i and j. If they have the same value - iteration works much faster than if their values differ:

Benchmark                     Mode  Cnt       Score      Error  Units
FloatsArrayBenchmark.times   thrpt   20  341805.800 ± 1623.320  ops/s
FloatsArrayBenchmark.times2  thrpt   20  198764.909 ± 1608.387  ops/s

Java bytecode is identical, which means it's related to some lower level optimizations. Can someone explain why this is happening? Here's the benchmark:

import org.openjdk.jmh.annotations.*;

public class FloatsArrayBenchmark {
    public static void main(String[] args) throws Exception {
        org.openjdk.jmh.Main.main(new String[]{FloatsArrayBenchmark.class.getSimpleName()});
    }

    @Benchmark @Fork(value = 1, warmups = 0)
    public void times(Data data) {
        float[] result = new float[10000];;
        for (int i = 0, j=0; i < 9_999; i++,j++)
            result[j] = data.floats[i] * 10;
    }
    @Benchmark @Fork(value = 1, warmups = 0)
    public void times2(Data data) {
        float[] result = new float[10000];
        for (int i = 0,j=1; i < 9_999; i++,j++)
            result[j] = data.floats[i] * 10;
    }

    @State(Scope.Benchmark)
    public static class Data {
        private final float[] floats = new float[10000];
    }
}

Environment:

  • MacOS, tried Java8, Java11, Java14
  • 2,4 GHz Quad-Core Intel Core i5
Stanislav Bashkyrtsev
  • 14,470
  • 7
  • 42
  • 45

1 Answers1

3

In the first (faster) version, i always (effectively) has the same value as j, so it:

public void times(Data data) {
    float[] result = new float[10000];;
    for (int i=0, j=0; i < 9_999; i++,j++)
        result[j] = data.floats[i] * 10;
}

can be re-written without j with identical effect:

public void times(Data data) {
    float[] result = new float[10000];;
    for (int i = 0; i < 9_999; i++)
        result[i] = data.floats[i] * 10;
}

It is likely that the compiler recognised thatj is redundant and eliminated it, resulting in half the number of ++ operations performed, which accounts for 1/3 of all aritmetic operations. This is consistent with the timings: the second version takes 70% longer per iteration. 70% is approxiately 50%, the result expected for a ratio of 3:2 operations.

Bohemian
  • 412,405
  • 93
  • 575
  • 722
  • A slightly smarter compiler could still CSE them and use a constant 4-byte offset as part of the addressing mode for `result[i+1]`, or do the load before an increment and the store after an increment. e.g. clang has no problem with this when compiling C ahead-of-time (https://godbolt.org/z/aeKMrG). But it wouldn't surprise me if HotSpot's JIT missed this simple optimization. – Peter Cordes Sep 08 '20 at 05:32
  • To a first approximation, one extra instruction in a small loop without unrolling can increase run-time by that much, on modern CPU front-ends like Haswell and later that don't have such a strong multiple-of-front-end-width effect ([Is performance reduced when executing loops whose uop count is not a multiple of processor width?](https://stackoverflow.com/q/39311872) - yes on Sandybridge, not as much on HSW). With unrolling it makes no sense, unless the compiler was really bad at unrolling and did each integer increment instead of using addressing-mode offsets. – Peter Cordes Sep 08 '20 at 05:34
  • Could be.. Though not sure if calculations add up - there's also a condition for the end of the loop. Also Java checks whether we went beyond the array size, I bet there are plenty of additional operations CPU has to perform. – Stanislav Bashkyrtsev Sep 08 '20 at 08:02