Java Enhanced-For-Loop FASTER than traditional?

Question

So my understanding was that enhanced for loops should be slower because they must use an Iterator.. However my code is providing mixed results.. (Yes I know that loop logic takes up the majority of time spent in a loop)

For a low number of iterations (100-1000), the enhanced for loop seems to be much faster with and without JIT. On the contrary with a high number of iterations (100000000), the traditional loop is much faster. What's going on here?

public class NewMain {

    public static void main(String[] args) {

        System.out.println("Warming up");

        int warmup = 1000000;
        for (int i = 0; i < warmup; i++) {
            runForLoop();
        }
        for (int i = 0; i < warmup; i++) {
            runEnhancedFor();
        }

        System.out.println("Running");
        int iterations = 100000000;
        long start = System.nanoTime();
        for (int i = 0; i < iterations; i++) {
            runForLoop();
        }
        System.out.println((System.nanoTime() - start) / iterations + "nS");

        start = System.nanoTime();
        for (int i = 0; i < iterations; i++) {
            runEnhancedFor();
        }
        System.out.println((System.nanoTime() - start) / iterations + "nS");
    }

    public static final List<Integer> array = new ArrayList(100);

    public static int l;

    public static void runForLoop() {
        for (int i = 0; i < array.size(); i++) {
            l += array.get(i);
        }
    }

    public static void runEnhancedFor() {
        for (int i : array) {
            l += i;
        }
    }
}

possible duplicate of [Enhanced for loop performance worse than traditional indexed lookup?](http://stackoverflow.com/questions/6839494/enhanced-for-loop-performance-worse-than-traditional-indexed-lookup) — Seitaridis, Jan 03 '14 at 04:25
have a look at this http://stackoverflow.com/questions/11555418/why-is-enhanced-for-loop-efficient-than-normal-for-loop — sasankad, Jan 03 '14 at 04:28
Right guys if you actually read my question.. My results are the exact OPPOSITE of said question. I have already read it. — Colby, Jan 03 '14 at 04:36
It appears that unless you are performing millions of loops, the enhanced loop is FASTER. — Colby, Jan 03 '14 at 04:36
@Colby You're generating random numbers. Read [this](http://www.ibm.com/developerworks/java/library/j-benchmark2/index.html) article. — Elliott Frisch, Jan 03 '14 at 05:01

score 37 · Accepted Answer · edited Jun 08 '18 at 09:33

Faulty benchmarking. The non exhaustive list of what is wrong:

No proper warmup: single shot measurements are almost always wrong;
Mixing several codepaths in the single method: we probably start compiling the method with the execution data available only for the first loop in the method;
Sources are predictable: should the loop compile, we can actually predict the result;
Results are dead-code eliminated: should the loop compile, we can throw the loop away

Take your time listening to these talks, and going through these samples.

This is how you do it arguably correct with jmh:

@OutputTimeUnit(TimeUnit.NANOSECONDS)
@BenchmarkMode(Mode.AverageTime)
@Warmup(iterations = 3, time = 1)
@Measurement(iterations = 3, time = 1)
@Fork(3)
@State(Scope.Thread)
public class EnhancedFor {

    private static final int SIZE = 100;

    private List<Integer> list;

    @Setup
    public void setup() {
        list = new ArrayList<Integer>(SIZE);
    }


    @GenerateMicroBenchmark
    public int enhanced() {
        int s = 0;
        for (int i : list) {
            s += i;
        }
        return s;
    }

    @GenerateMicroBenchmark
    public int indexed() {
        int s = 0;
        for (int i = 0; i < list.size(); i++) {
            s += list.get(i);
        }
        return s;
    }

    @GenerateMicroBenchmark
    public void enhanced_indi(BlackHole bh) {
        for (int i : list) {
            bh.consume(i);
        }
    }

    @GenerateMicroBenchmark
    public void indexed_indi(BlackHole bh) {
        for (int i = 0; i < list.size(); i++) {
            bh.consume(list.get(i));
        }
    }

}

...which yields something along the lines of:

Benchmark                         Mode   Samples      Mean   Mean error    Units
o.s.EnhancedFor.enhanced          avgt         9     8.162        0.057    ns/op
o.s.EnhancedFor.enhanced_indi     avgt         9     7.600        0.067    ns/op
o.s.EnhancedFor.indexed           avgt         9     2.226        0.091    ns/op
o.s.EnhancedFor.indexed_indi      avgt         9     2.116        0.064    ns/op

Now that's a tiny difference between enhanced and indexed loops, and that difference is naively explained by taking the different code paths to access the backing storage. However, the explanation is actually much simpler: OP FORGOT TO POPULATE THE LIST, which means the loop bodies ARE NEVER EVER EXECUTED, and the benchmark is actually measuring the cost of size() vs iterator()!

Fixing that:

@Setup
public void setup() {
    list = new ArrayList<Integer>(SIZE);
    for (int c = 0; c < SIZE; c++) {
        list.add(c);
    }
}

...yields then:

Benchmark                         Mode   Samples       Mean   Mean error    Units
o.s.EnhancedFor.enhanced          avgt         9    171.154       25.892    ns/op
o.s.EnhancedFor.enhanced_indi     avgt         9    384.192        6.856    ns/op
o.s.EnhancedFor.indexed           avgt         9    148.679        1.357    ns/op
o.s.EnhancedFor.indexed_indi      avgt         9    465.684        0.860    ns/op

Note the differences are really minute even on the nano-scale, and the non-trivial loop bodies will consume the difference, if any. The differences here can be explained by how lucky we are in inlining get() and Iterator methods, and the optimizations that we could enjoy after those inlinings.

Note the indi_* tests, which negate down the loop unrolling optimizations. Even though indexed enjoys better performance while successfully unrolled, but it is the opposite when the unrolling is broken!

With the headlines like that, the difference between indexed and enhanced is nothing more than of academic interest. Figuring out the exact generated code -XX:+PrintAssembly for all the cases is left as exercise to the reader :)

Patricia Shanahan · Answer 2 · 2014-01-03T05:29:45.713

There are two very different issues in the question. One is a valid observation that, in one specific program, with low iteration count the enhanced for loop time was faster. The other is a generalization of that observation to "For a low number of iterations (100-1000), the enhanced for loop seems to be much faster with and without JIT."

I see no justification for that generalization. I made a small change to the program, running the basic for-loop test first, then the enhanced for-loop. I also labeled the outputs to reduce confusion in dealing with modified versions. Here is my output for 100 iterations:

Warming up
Running
Enhanced For-Loop 2002nS
Basic For-Loop 70nS

With the loops in the original order, I get:

Warming up
Running
Basic For-Loop 2139nS
Enhanced For-Loop 137nS

If I warm up the second loop immediately before running it, instead of doing both warm-ups at the start, I get:

Warming up
Running
Basic For-Loop 1093nS
Enhanced For-Loop 984nS

The results, for low iteration count, are very dependent on fine details of the program, an inherent danger of micro-benchmarking, and a reason to avoid generalizing from a single program observation to a general assumption about how the measured code would perform in any other program.

Good advice. Remember that not only is there the JITted versus un-JITted performance difference (which is one good reason to allow plenty of warm-up time), but that the actual optimization may depend on the context in which the code is being used (which makes isolated "micro kernel" tests of this sort potentially misleading even after having been JITted). — keshlam, Jan 03 '14 at 05:31
Also remember that micro-optimizations are often a waste of programmer time. Infinite optimization of something that only accounts for 1% of the program's total runtime could consume infinite programmer hours yet yield at most a 1% total performance improvement. Instead, concentrate on efficient algorithms and avoid writing obviously bad code, then when you've got it running use performance analyzers to find out where the program is *really* spending its time and start optimizing those sections. — keshlam, Jan 03 '14 at 05:33

Java Enhanced-For-Loop FASTER than traditional?

2 Answers2

Linked