Why are two separate loops faster than one?

Question

I want to understand what kind of optimizations Java does to consecutive for loops. More precisely, I'm trying to check if loop fusion is performed. Theoretically, I was expecting that this optimization was not done automatically and was expecting to confirm that the fused version was faster than the version with two loops.

However, after running the benchmarks, the results show that two separate (and consecutive) loops are faster than one single loop doing all the work.

I already tried using JMH for creating the benchmarks and got the same results.

I used the javap command and it shows that the generated bytecode for the source file with two loops actually corresponds to two loops being executed (no loop unrolling or other optimization was performed).

Code being measured for BenchmarkMultipleLoops.java:

private void work() {
        List<Capsule> intermediate = new ArrayList<>();
        List<String> res = new ArrayList<>();
        int totalLength = 0;

        for (Capsule c : caps) {
            if(c.getNumber() > 100000000){
                intermediate.add(c);
            }
        }

        for (Capsule c : intermediate) {
            String s = "new_word" + c.getNumber();
            res.add(s);
        }

        //Loop to assure the end result (res) is used for something
        for(String s : res){
            totalLength += s.length();
        }

        System.out.println(totalLength);
    }

Code being measured for BenchmarkSingleLoop.java:

private void work(){
        List<String> res = new ArrayList<>();
        int totalLength = 0;

        for (Capsule c : caps) {
            if(c.getNumber() > 100000000){
                String s = "new_word" + c.getNumber();
                res.add(s);
            }
        }

        //Loop to assure the end result (res) is used for something
        for(String s : res){
            totalLength += s.length();
        }

        System.out.println(totalLength);
    }

And here is the code for Capsule.java:

public class Capsule {
    private int number;
    private String word;

    public Capsule(int number, String word) {
        this.number = number;
        this.word = word;
    }

    public int getNumber() {
        return number;
    }

    @Override
    public String toString() {
        return "{" + number +
                ", " + word + '}';
    }
}

caps is an ArrayList<Capsule> with 20 million elements populated like this in the beginning:

private void populate() {
        Random r = new Random(3);

        for(int n = 0; n < POPSIZE; n++){
            int randomN = r.nextInt();
            Capsule c = new Capsule(randomN, "word" + randomN);
            caps.add(c);
        }
    }

Before measuring, a warm up phase is executed.

I ran each of the benchmarks 10 times or, in other words, the work() method is executed 10 times for each benchmark and the average times to complete are presented below (in seconds). After each iteration, the GC was executed along with a few sleeps:

MultipleLoops: 4.9661 seconds
SingleLoop: 7.2725 seconds

OpenJDK 1.8.0_144 running on an Intel i7-7500U (Kaby Lake).

Why is the MultipleLoops version faster than the SingleLoop version, even though it has to traverse two different data structures?

UPDATE 1:

As suggested in the comments, if I change the implementation to calculate the totalLength while strings are being generated, avoiding the creation of the res list, the single loop version becomes faster.

However, that variable was only introduced so that some work was done after creating the result list, in order to avoid discarding the elements if nothing was done with them.

In other words, the intended result is to produce the final list. But this suggestion helps in better understanding what is going on.

Results:

MultipleLoops: 0.9339 seconds
SingleLoop: 0.66590005 seconds

UPDATE 2:

Here is a link for the code I used for the JMH benchmark: https://gist.github.com/FranciscoRibeiro/2d3928761f76e4f7cecfcfcdf7fc96d5

Results:

MultipleLoops: 7.397 seconds
SingleLoop: 8.092 seconds

What happens if you increment `totalSize` directly, skipping creating the `result` list? — Andy Turner, Feb 23 '18 at 16:45
I guess it's because branch prediction. https://stackoverflow.com/questions/11227809/why-is-it-faster-to-process-a-sorted-array-than-an-unsorted-array/11227902#11227902 — Johannes Kuhn, Feb 23 '18 at 16:46
Keep in mind that your JVM will JIT compile the bytecode, so the instructions that your CPU actually runs don't necessarily reflect unrolling or lack thereof in the bytecode. What JVM did you use, and what hardware? (e.g. i7-6700k (Skylake), or Ryzen, or Silvermont, or what? A CPU model number would be fine, if you don't know the name of the microarchitecture.) — Peter Cordes, Feb 23 '18 at 16:58
@AndyTurner updated the question with info based on your comment — Francisco Ribeiro, Feb 23 '18 at 17:59
@JohannesKuhn that is an interesting observation. So, the overhead of mispredicting in the SingleLoop version is greater than in the MultipleLoops version? Because in SingleLoop, the creation of the Strings would be "reverted", whereas in MultipleLoops, the only thing that might go to waste is the addition of the Capsule objects to the intermediate list. Does this make sense? — Francisco Ribeiro, Feb 23 '18 at 18:04
@PeterCordes yes, JIT might be doing some of its tricks here... is there any way I can check which optimizations it is applying to the bytecode? The JVM I'm using is OpenJDK 1.8.0_144 and the CPU is an Intel i7-7500U (Kaby Lake). — Francisco Ribeiro, Feb 23 '18 at 18:09
Yeah, you can look at the asm produced by the JIT compiler. https://stackoverflow.com/questions/1503479/how-to-see-jit-compiled-code-in-jvm, and / or https://stackoverflow.com/a/48599683/224132. — Peter Cordes, Feb 23 '18 at 18:13
Just tried to reproduce your results without success. For me, both loops have approx. the same performance, see https://gist.github.com/l-wi/20a923af1ee707e885e087e136af0bfe. Could there be a flaw in your (or my) benchmark code? — lwi, Feb 23 '18 at 18:24
Usually for that sort of procedures JVM needs to be configured specifically for CPU, maybe it worth to look at garbage collector configurations and optimise it — Evilguy, Feb 23 '18 at 18:57
@JohannesKuhn: Branch prediction seems unlikely; both versions have the same input. Possibly the mispredict penalty is higher because of the details of the surrounding code in the combined loop, but I'd guess that the mispredict *rate* is the same. — Peter Cordes, Feb 23 '18 at 19:37
Please provide an MCVE. Emphasis on C ... for Complete. You said: *"I already tried using JMH for creating the benchmarks and got the same results."* Then you should replace the code in the question with the JMH-based version and the corresponding results. Right now, there are people saying that they can't reproduce your results, but we / they don' know if they are running the same benchmark as you. — Stephen C, Feb 23 '18 at 23:49
@lwi The collection you used has a smaller size than the one I'm using (2M vs 20M). But I think the results you got already show a difference in performance between the two alternatives. Did you try using the 20M collection size? Also, I don't rebuild the list every time I execute a benchmark method because the list does not get modified. Can this be a flaw? I also executed the benchmark you posted and got 0.107 s/op (one loop) vs 0.071 s/op (two loops). Error of +- 0.001 for both. — Francisco Ribeiro, Feb 26 '18 at 09:46
@StephenC I updated the question with a link to the JMH benchmark code I used — Francisco Ribeiro, Feb 26 '18 at 09:57
See also [How do I write a correct micro-benchmark in Java?](https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java). — lexicore, Mar 01 '18 at 21:13
Try doing the SingleLoops test first. It could be that the test you run first is creating GC work during the test you run second. — Matt Timmermans, Mar 29 '18 at 03:39
@MattTimmermans I tried it at the time and got the same kind of results. However, GC does seem to be causing this results — Francisco Ribeiro, Apr 06 '18 at 16:08

score 2 · Answer 1 · answered Apr 05 '18 at 15:14

I investigated this "phenomena" and look like got something like an answer.
Let's add .jvmArgs("-verbose:gc") to JMHs OptionsBuilder. Results for 1 Iteration:

Single Loop: [Full GC (Ergonomics) [PSYoungGen: 2097664K->0K(2446848K)] [ParOldGen: 3899819K->4574771K(5592576K)] 5997483K->4574771K(8039424K), [Metaspace: 6208K->6208K(1056768K)], 5.0438301 secs] [Times: user=37.92 sys=0.10, real=5.05 secs] 4.954 s/op

Multiple Loops: [Full GC (Ergonomics) [PSYoungGen: 2097664K->0K(2446848K)] [ParOldGen: 3899819K->4490913K(5592576K)] 5997483K->4490913K(8039424K), [Metaspace: 6208K->6208K(1056768K)], 3.7991573 secs] [Times: user=26.84 sys=0.08, real=3.80 secs] 4.187 s/op

JVM spent huge amount of CPU time for GC. Once per 2 test runs, JVM have to make Full GC (move 600Mb to OldGen and collect 1.5Gb of trash from previous cycles). Both garbage collectors done the same work, but spent ~25% less application time for multiple loops test case. If we decrease the POPSIZE to 10_000_000 or add to before bh.consume() Thread.sleep(3000), or add -XX:+UseG1GC to JVM args, then multiple loop boosting effect gone. I run it once again with .addProfiler(GCProfiler.class). The main difference:

Multiple Loops: gc.churn.PS_Eden_Space 374.417 ± 23 MB/sec

Single Loop: gc.churn.PS_Eden_Space 336.037 MB/sec ± 19 MB/sec

I think, we see speed up in such specific circumstances, because old-good Compare and Swap GC algorithm has CPU bottleneck for multiples test runs and uses extra "senseless" cycle for Collect Garbage from earlier runs. It's even easier to reproduce with @Threads(2), if you have enough RAM. It looks like this, if you try to profile Single_Loop test:

profiling

Some time after posting the question, I made some experiments with different GC algorithms: Serial GC, Parallel GC and G1GC. Parallel GC seemed to affect the performance of SingleLoop the most. Using G1GC or Serial GC shortened the difference between Multiple and Single, but the best I could achieve was having SingleLoop perform almost identically to MultipleLoops. Your experiments seem to point in the same direction, that is, GC is the "culprit". Thanks! — Francisco Ribeiro, Apr 06 '18 at 16:17
May be, I wasn't super clear. We have perfomance boost mainly because of Compare and Swap GC (which parallel as well) cheats JMH using multiple CPU cores, even in case you set @Thread=1 explicitly. Thank you for interesting quastion and good luck ;) — Anton Kot, Apr 06 '18 at 16:54

score 1 · Answer 2 · answered Mar 29 '18 at 03:19

To understand what is happening under the hood, you could add JMX behaviour to analyse running app in jvisualvm, located in JAVA_HOME\bin With 20M size of a capsule list in the memory, it ran out of memory and visualvm went in non-responding state. I had reduced capsule list size to 200k and 100M to 1M in if condition to test. After observing behaviour on visualvm, single loop execution completed prior to multiple loops. Maybe this is not the right approach, but you could experiment with it.

LoopBean.java

import java.util.List;
public interface LoopMBean {
    void multipleLoops();
    void singleLoop();
    void printResourcesStats();
}

Loop.java

import java.util.ArrayList;
import java.util.List;
import java.util.Random;

public class Loop implements LoopMBean {

    private final List<Capsule> capsules = new ArrayList<>();

    {
        Random r = new Random(3);
        for (int n = 0; n < 20000000; n++) {
            int randomN = r.nextInt();
            capsules.add(new Capsule(randomN, "word" + randomN));
        }
    }

    @Override
    public void multipleLoops() {

        System.out.println("----------------------Before multiple loops execution---------------------------");
        printResourcesStats();

        final List<Capsule> intermediate = new ArrayList<>();
        final List<String> res = new ArrayList<>();
        int totalLength = 0;

        final long start = System.currentTimeMillis();

        for (Capsule c : capsules)
            if (c.getNumber() > 100000000) {
                intermediate.add(c);
            }

        for (Capsule c : intermediate) {
            String s = "new_word" + c.getNumber();
            res.add(s);
        }

        for (String s : res)
            totalLength += s.length();

        System.out.println("multiple loops=" + totalLength + " | time taken=" + (System.currentTimeMillis() - start) + " milliseconds");

        System.out.println("----------------------After multiple loops execution---------------------------");
        printResourcesStats();

        res.clear();
    }

    @Override
    public void singleLoop() {

        System.out.println("----------------------Before single loop execution---------------------------");
        printResourcesStats();

        final List<String> res = new ArrayList<>();
        int totalLength = 0;

        final long start = System.currentTimeMillis();

        for (Capsule c : capsules)
            if (c.getNumber() > 100000000) {
                String s = "new_word" + c.getNumber();
                res.add(s);
            }

        for (String s : res)
            totalLength += s.length();

        System.out.println("Single loop=" + totalLength + " | time taken=" + (System.currentTimeMillis() - start) + " milliseconds");
        System.out.println("----------------------After single loop execution---------------------------");
        printResourcesStats();

        res.clear();
    }

    @Override
    public void printResourcesStats() {
        System.out.println("Max Memory= " + Runtime.getRuntime().maxMemory());
        System.out.println("Available Processors= " + Runtime.getRuntime().availableProcessors());
        System.out.println("Total Memory= " + Runtime.getRuntime().totalMemory());
        System.out.println("Free Memory= " + Runtime.getRuntime().freeMemory());
    }
}

LoopClient.java

import javax.management.MBeanServer;
import javax.management.ObjectName;
import java.lang.management.ManagementFactory;

public class LoopClient {

    void init() {

        final MBeanServer mBeanServer = ManagementFactory.getPlatformMBeanServer();
        try {
            mBeanServer.registerMBean(new Loop(), new ObjectName("LOOP:name=LoopBean"));
        } catch (Exception e) {
            e.printStackTrace();
        }

    }

    public static void main(String[] args) {

        final LoopClient client = new LoopClient();
        client.init();
        System.out.println("Loop client is running...");
        waitForEnterPressed();
    }

    private static void waitForEnterPressed() {
        try {
            System.out.println("Press  to continue...");
            System.in.read();
        } catch (Exception e) {
            e.printStackTrace();
        }
    }
}

Execute with following command:

java -Dcom.sun.management.jmxremote -Dcom.sun.management.jmxremote.port=9999 -Dcom.sun.management.jmxremote.authenticate=false -Dcom.sun.management.jmxremote.ssl=false LoopClient

You could add -Xmx3072M extra option for quick memory increase to avoid OutOfMemoryError

Why are two separate loops faster than one?

UPDATE 1:

UPDATE 2:

2 Answers2