Enhanced for loop performance worse than traditional indexed lookup?

Question

I just came across this seemingly innocuous comment, benchmarking ArrayList vs a raw String array. It's from a couple years ago, but the OP writes

I did notice that using for String s: stringsList was about 50% slower than using an old-style for-loop to access the list. Go figure...

Nobody commented on it in the original post, and the test seemed a little dubious (too short to be accurate), but I nearly fell out of my chair when I read it. I've never benchmarked an enhanced loop against a "traditional" one, but I'm currently working on a project that does hundreds of millions of iterations over ArrayList instances using enhanced loops so this is a concern to me.

I'm going to do some benchmarking and post my findings here, but this is obviously a big concern to me. I could find precious little info online about relative performance, except for a couple offhand mentions that enhanced loops for ArrayLists do run a lot slower under Android.

Has anybody experienced this? Does such a performance gap still exist? I'll post my findings here, but was very surprised to read it. I suspect that if this performance gap did exist, it has been fixed in more modern VM's, but I guess now I'll have to do some testing and confirm.

Update: I made some changes to my code, but was already suspecting what others here have already pointed out: sure the enhanced for loop is slower, but outside of very trivial tight loops, the cost should be a miniscule fraction of the cost of the logic of the loop. In my case, even though I'm iterating over very large lists of strings using enhanced loops, my logic inside the loop is complex enough that I couldn't even measure a difference after switching to index-based loops.

TL;DR: enhanced loops are indeed slower than a traditional index-based loop over an arraylist; but for most applications the difference should be negligible.

My understanding was that the notation creates an Iterator object for the collection, and each iteration the Iterator.next() is called instead of using an index. It would make sense that this is actually slower than accessing the elements manually in the case of an array. — , Jul 27 '11 at 04:51
Yeah, now that I've seen it and looked at the source, I get that there is significantly more overhead in the enhanced loop than in a direct array access. I'd be shocked if it's 50% in modern VM's but I'll soon find out; I'm still really surprised there is seemingly little documentation about the performance differences though. — jkraybill, Jul 27 '11 at 04:56
Take a look at this: http://developer.android.com/guide/practices/design/performance.html#foreach — , Jul 27 '11 at 05:02
Yeah it's actually the same content that I referred to in my second link. 3x seems crazy to me, even under mobile, but if true, seems like it should be a wider-known performance issue. I've been writing enhanced loops since they were introduced and never considered the cost. — jkraybill, Jul 27 '11 at 05:28
In majority of cases this is still micro optimization. Relative difference between these two approaches is rarely relevant. Heck, for-each could be 100% slower than old school for-loop. Question is, are the actual numbers meaningful? Does it matter if in your business logic some loop takes, say 200ms, of which actual looping (for-each vs for) takes 2ms vs 1ms? — merryprankster, Jul 27 '11 at 07:15
Is this improved in Java 8 or going to be improved in the future? — Mladen Adamovic, Feb 06 '15 at 07:23

Peter Lawrey · Accepted Answer · 2011-07-27T07:36:27.183

The problem you have is that using an Iterator will be slower than using a direct lookup. On my machine the difference is about 0.13 ns per iteration. Using an array instead saves about 0.15 ns per iteration. This should be trivial in 99% of situations.

public static void main(String... args) {
    int testLength = 100 * 1000 * 1000;
    String[] stringArray = new String[testLength];
    Arrays.fill(stringArray, "a");
    List<String> stringList = new ArrayList<String>(Arrays.asList(stringArray));
    {
        long start = System.nanoTime();
        long total = 0;
        for (String str : stringArray) {
            total += str.length();
        }
        System.out.printf("The for each Array loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
    }
    {
        long start = System.nanoTime();
        long total = 0;
        for (int i = 0, stringListSize = stringList.size(); i < stringListSize; i++) {
            String str = stringList.get(i);
            total += str.length();
        }
        System.out.printf("The for/get List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
    }
    {
        long start = System.nanoTime();
        long total = 0;
        for (String str : stringList) {
            total += str.length();
        }
        System.out.printf("The for each List loop time was %.2f ns total=%d%n", (double) (System.nanoTime() - start) / testLength, total);
    }
}

When run with one billion entries entries prints (using Java 6 update 26.)

The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000

When run with one billion entries entries prints (using OpenJDK 7.)

The for each Array loop time was 0.76 ns total=1000000000
The for/get List loop time was 0.91 ns total=1000000000
The for each List loop time was 1.04 ns total=1000000000

i.e. exactly the same. ;)

score 5 · Answer 2 · answered Jul 27 '11 at 06:11

5

Every claim that X is slower than Y on a JVM which does not address all the issues presented in this article ant it's second part spreads fears and lies about the performance of a typical JVM. This applies to the comment referred to by the original question as well as to GravityBringer's answer. I am sorry to be so rude, but unless you use appropriate micro benchmarking technology your benchmarks produce really badly skewed random numbers.

Tell me if you're interested in more explanations. Although it is all in the articles I referred to.

answered Jul 27 '11 at 06:11

jmg

7,308
1
18
22

I'm aware that it is difficult to measure performance super-accurately the way that I've done it. However, I think what I've done is a reasonable quick-and-dirty benchmark that can successfully detect very large performance differences. I did commit a large mistake in the array performance measuring code in my original post that I've now fixed (which made the array tests only comparable to each other and not to the ArrayList tests). Having fixed the unboxing issue, I'm still seeing a large difference, however. – Gravity Jul 27 '11 at 06:21
I know about the pitfalls in benchmarking, theoretical tests, etc; that's why I'm going to try making some loop changes in my production code which does real work on massive loops and runs for typically 24-48 hours on a large dataset and see what happens. Even then, obviously EMWV (everyone's mileage will vary :) – jkraybill Jul 27 '11 at 06:21
This [SO question on the topic](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) has some pretty good information as well. – Joachim Sauer Jul 27 '11 at 06:22
@GravityBringer: The JIT easily changes the execution time of simple loops by orders of magnitude. Therefore I disagree that it is "reasonable" to do such "quick-and-dirty benchmark"s. – jmg Jul 27 '11 at 07:29
@jmg I actually read the articles you linked to. Good articles, thanks for the link. However, I would like to note that I was already doing almost everything indicated there. I have now changed my test to deal with the one issue I wasn't dealing with before (On-Stack Replacement), but that didn't change my results significantly. I'm forced to conclude this is probably a configuration difference. I would love to actually get my ArrayLists to be as fast as an array, though. I wish I was getting the same results as everyone else! – Gravity Jul 27 '11 at 09:07

irreputable · Answer 3 · 2011-07-27T07:24:11.150

3

GravityBringer's number doesn't seem right, because I know ArrayList.get() is as fast as raw array access after VM optimization.

I ran GravityBringer's test twice on my machine, -server mode

50574847
43872295
30494292
30787885
(2nd round)
33865894
32939945
33362063
33165376

The bottleneck in such tests is actually memory read/write. Judging from the numbers, the entire 2 arrays are in my L2 cache. If we decrease the size to fit L1 cache, or if we increase the size beyond L2 cache, we'll see 10X throughput difference.

The iterator of ArrayList uses a single int counter. Even if VM doesn't put it in a register (the loop body is too complex), at least it will be in the L1 cache, therefore r/w of are basically free.

The ultimate answer of course is to test your particular program in your particular environment.

Though it's not helpful to play agnostic whenever a benchmark question is raised.

edited Jul 27 '11 at 07:24

answered Jul 27 '11 at 07:09

irreputable

44,725
9
65
93

Hmmm. I guess it depends on the machine and settings. I ran my test in a for loop as well and didn't see much difference between successive rounds. My numbers aren't even close to yours. One person that commented on my post seems to have gotten similar results to yours, though. After VM optimization...apparently my VM doesn't optimize the same way. What version of Java do you have? It makes sense, of course, that a JVM could and should optimize the foreach loop. – Gravity Jul 27 '11 at 07:24
It appears that your VM doesn't inline `ArrayList.get()`, which is the most basic optimization. – irreputable Jul 27 '11 at 07:42
I'm looking at the source of ArrayList [here](http://www.java2s.com/Open-Source/Java-Document/6.0-JDK-Core/Collections-Jar-Zip-Logging-regex/java/util/ArrayList.java.htm), and it looks like the get() method has more code than just an array access (an additional check). That's why I'm puzzled at how you're getting the same performance. Maybe some JVMs have a special ArrayList-specific optimization? – Gravity Jul 27 '11 at 07:49
1

that's nothing. if the methods are inlined, even if VM doesn't do further optimization, the CPU will take care of it. "branch prediction" will reduce cost of range checks to nothing. the main cost is loading the elements from memory. – irreputable Jul 27 '11 at 07:56
Branch prediction can't make if statements free. At the least, it's still a compare instruction and a conditional jump instruction. Now that's cheap, but accessing memory that is very likely to be next to each other is also rather cheap because of caching. I could see performance being highly similar, but it's strange that you aren't measuring any difference whatsoever. – Gravity Jul 27 '11 at 08:10

Gravity · Answer 4 · 2011-07-27T06:08:38.560

0

The situation has gotten worse for ArrayLists. On my computer running Java 6.26, there is a fourfold difference. Interestingly (and perhaps quite logically), there is no difference for raw arrays. I ran the following test:

    int testSize = 5000000;

    ArrayList<Double> list = new ArrayList<Double>();
    Double[] arr = new Double[testSize];

    //set up the data - make sure data doesn't have patterns
    //or anything compiler could somehow optimize
    for (int i=0;i<testSize; i++)
    {
        double someNumber = Math.random();
        list.add(someNumber);
        arr[i] = someNumber;
    }

    //ArrayList foreach
    long time = System.nanoTime();
    double total1 = 0;
    for (Double k: list)
    {
        total1 += k;
    }
    System.out.println (System.nanoTime()-time);

    //ArrayList get() method
    time = System.nanoTime();
    double total2 = 0;
    for (int i=0;i<testSize;i++)
    {
        total2 += list.get(i);  
    }
    System.out.println (System.nanoTime()-time);        

    //array foreach
    time = System.nanoTime();
    double total3 = 0;
    for (Double k: arr)
    {
        total3 += k;
    }
    System.out.println (System.nanoTime()-time);

    //array indexing
    time = System.nanoTime();
    double total4 = 0;
    for (int i=0;i<testSize;i++)
    {
        total4 += arr[i];
    }
    System.out.println (System.nanoTime()-time);

    //would be strange if different values were produced,
    //but no, all these are the same, of course
    System.out.println (total1);
    System.out.println (total2);        
    System.out.println (total3);
    System.out.println (total4);

The arithmetic in the loops is to prevent the JIT compiler from possibly optimizing away some of the code. The effect of the arithmetic on performance is small, as the runtime is dominated by the ArrayList accesses.

The runtimes are (in nanoseconds):

ArrayList foreach: 248,351,782

ArrayList get(): 60,657,907

array foreach: 27,381,576

array direct indexing: 27,468,091

edited Jul 27 '11 at 06:08

answered Jul 27 '11 at 05:50

Gravity

2,696
1
20
29

2

That's a huge difference, although I think a part of the performance difference in your testing is coming from auto-unboxing. I'm going to try a similar test with Strings. – jkraybill Jul 27 '11 at 05:54
Sure, but auto-unboxing is present in both the ArrayList test cases, is it not? The return type of `get()` is `Double` and not `double`. Auto-unboxing is likely to account for some of the difference between the array and ArrayList cases. – Gravity Jul 27 '11 at 06:00
Yes, that's what I was referring to. I wouldn't think unboxing would contribute to the foreach vs. get() difference, but it's just one of those things that sometimes can unexpectedly muddy the waters for performance comparisons. I'll post my findings using my production code. – jkraybill Jul 27 '11 at 06:08
It's not as bad as you make it out to be. You have a critical error in your code. You are comparing an `ArrayList` with a `double[]`. The ArrayList incurs the overhead of autoboxing. I changed the array to `Double[]` and got the following 4 times: 48411663,54258692,32354829,32090971. When I replaced `Double` with my own wrapper class (so no autoboxing) I got similar times: 53714765,51310454,38736870,38451530. In the worst case, that is a difference of 3 ns per iteration, a small price to pay for convenience. – Paul Jul 27 '11 at 06:09
See my edited post. You were right about the unboxing. Does my edit address your concerns, though? – Gravity Jul 27 '11 at 06:09
You're code has the critical flaw that it uses inadequate measuring. See, the references in my answer. – jmg Jul 27 '11 at 06:12
@Paul: Unboxing does not account for the difference in speed between the two different ArrayList tests, as they both unbox equally. Investigating the relationship between the two ArrayList uses was the OP's question, and I (sloppily) added the array test as an afterthought. However, I've now corrected my code, and I'm still seeing a wild performance difference. Do you agree my new test code is correct? – Gravity Jul 27 '11 at 06:13
@Paul: what numbers are you seeing for the other (ArrayList) stats? Are your finding comparable to mine after my edit? – Gravity Jul 27 '11 at 06:19
It looks reasonable but I'd have to run it through a profiler, something I'm not going to do at this late hour. If you really want to turn off optimizations execute the code with this flag: `-Djava.compiler=NONE` That really makes a big difference, especially in the use of the ArrayList. – Paul Jul 27 '11 at 06:20
@Gravity, my numbers above are using `Double` - I did not see the large difference you did. I'd be curious to see what numbers you'd get for the 2nd run if you wrapped the test in a loop to execute it twice. That large first value does not make sense and I wonder if there is some initialization or loading going on under the hood for the ArrayList. – Paul Jul 27 '11 at 06:26
There is definitely some start up overhead at play. When I loop the test a few times, using a `testSize` of a more reasonable 50, subsequent runs of the ArrayList for:each are 1/10th the first run's time. – Paul Jul 27 '11 at 06:33
Just ran it in a loop 10 times, and each time gives near-identical results to the numbers I've reported. My exact code is in my post, if you want to try exactly the same thing. Looking at the source for `ArrayList`, it seems that it involves a few extra method calls and checks, so the difference between Array and ArrayList is reasonable. I don't know about the difference between the two ArrayList tests, though. I think with 5,000,000 the initialization costs are negligible. With 50, it's hard to get timing of any real accuracy, unless you loop many times. – Gravity Jul 27 '11 at 06:39
-1: I'm not getting these results (yes, *same code*). After JIT (and possibly other JVM optimizations) the difference between the four are negligible. *Choose the one that is best for you!* – dacwe Jul 27 '11 at 08:35
Interesting. The difference is negligible for you. All I can tell you is that I'm getting the numbers I indicated, so it apparently varies from configuration to configuration. Other people have reported results similar to yours, or differences that are measurable but nonetheless not as big as the ones I'm seeing. I wonder what's at work here... – Gravity Jul 27 '11 at 08:58
All I meant was that without an explanation why you are getting the results they are pretty worthless. Other people might see this question (and your answer) and think "well, I will from now on only use Arrays as they seem by far the fastest", and that is just not true... – dacwe Jul 27 '11 at 10:09
Even if many more people actually do get my results (I would think that some people somewhere would), it doesn't mean that good coding habits should be broken for performance unless it can be proven it makes a difference. – Gravity Jul 27 '11 at 16:39

Enhanced for loop performance worse than traditional indexed lookup?

4 Answers4

Linked