Why does running multiple lambdas in loops suddenly slow down?

Question

Consider the following code:

public class Playground {

    private static final int MAX = 100_000_000;

    public static void main(String... args) {
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
    }

    public static void execute(Runnable task) {
        Stopwatch stopwatch = Stopwatch.createStarted();
        for (int i = 0; i < MAX; i++) {
            task.run();
        }
        System.out.println(stopwatch);
    }

}

This currently prints the following on my Intel MBP on Temurin 17:

3.675 ms
1.948 ms
216.9 ms
243.3 ms

Notice the 100* slowdown for the third (and any subsequent) execution. Now, obviously, this is NOT how to write benchmarks in Java. The loop code doesn't do anything, so I'd expect it to be eliminated for all and any repetitions. Also I could not repeat this effect using JMH which tells me the reason is tricky and fragile.

So, why does this happen? Why would there suddenly be such a catastrophic slowdown, what's going on under the hood? An assumption is that C2 bails on us, but which limitation are we bumping into?

Things that don't change the behavior:

using anonymous inner classes instead of lambdas,
using 3+ different nested classes instead of lambdas.

Things that "fix" the behavior. Actually the third invocation and all subsequent appear to be much faster, hinting that compilation correctly eliminated the loops completely:

using 1-2 nested classes instead of lambdas,
using 1-2 lambda instances instead of 4 different ones,
not calling task.run() lambdas inside the loop,
inlining the execute() method, still maintaining 4 different lambdas.

Okay, so: Writing a correct microbenchmark in Java by hand is **hard**: https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java. Yours is obviously flawed. Use [JMH](https://github.com/openjdk/jmh). That said, the result you're getting is interesting on its own and I'd like to understand the reason for the behaviour. — Petr Janeček, Jan 14 '22 at 15:44
If A1/B/A2 are switched around, the third one always seems to take a lot longer than the first — DuncG, Jan 14 '22 at 15:46
@DuncG yes, that is another weird property of this experiment — Jakes, Jan 14 '22 at 15:52
@PetrJaneček, this is an oversimplified class to **clearly** show the weird results happening here, and not a final test template to be used as a benchmark model. The ones I use are way more complex using different heuristics — Jakes, Jan 14 '22 at 15:55
This is perfectly normal on the JVM, and the reason you use benchmarking tools. — Louis Wasserman, Jan 14 '22 at 16:06
@LouisWasserman Normally I'd fully agree. Even in this case it's not hard to convince the compiler to completely eliminate all code in `expectedResults()` and make it finish in a millisecond. That's understandable. But still, just for our JVM undertanding, how does it happen that a loop doing nothing suddenly slows down 10*, and continues to be super-slow even if we continue calling it? What tool can we use to isolate / understand the behaviour? — Petr Janeček, Jan 14 '22 at 16:31
@PetrJaneček, like you said (and that was the reason I posted the code here) when escalating it one can confirm that it will go on forever with a higher (but stable) time values even though the first 2 are always lower. This is even more noticeable when using higher loop ranges — Jakes, Jan 14 '22 at 16:33
@Jakes I took to liberty to simplify your code to its minimal form, and rewrote the question to specifically ask for the reason of the single effect we're looking at. Please feel free to revert / edit my text to your liking. I simply tried to focus the question on the subject matter, — Petr Janeček, Jan 15 '22 at 00:50
The explanation is probably related to JIT compilation. An my guess is that your "unrolling" of the loop in `main` is implicated too. That means you have 4 distinct lambdas instead of one. Each one will be bytecode compiled to a distinct .class file, so the JIT compiler (probably) won't compile them all at the same time. — Stephen C, Jan 15 '22 at 01:18
The various tweaks that make the "effect" go away will all affect the JIT compiler in one way or another. But the bottom line is that you shouldn't see this thing in *typical* (i.e. long running) Java applications because it would get all of the JIT compilation done early. (And if not, use a JVM that supports AOT compilation ...) — Stephen C, Jan 15 '22 at 01:25
If you really want to see what is happening, turn on JIT compiler logging. And if necessary, examine the native code it emits. But I suspect the "when" the JIT compiler runs will reveal the clues you need to understand. — Stephen C, Jan 15 '22 at 01:30
@PetrJaneček well this is a bit odd and a strange way to approach my initial question, I believe. My questions was abstract to lambdas or any form of expression and was regarding the multiple execution of the same line of code. But if this is the normal way to go, than so be it — Jakes, Jan 15 '22 at 23:25

Aleksey Shipilev · Answer 1 · 2022-01-15T07:54:37.870

You can actually replicate this with JMH SingleShot mode:

@BenchmarkMode(Mode.SingleShotTime)
@Warmup(iterations = 0)
@Measurement(iterations = 1)
@Fork(1)
public class Lambdas {

    @Benchmark
    public static void doOne() {
        execute(() -> {});
    }

    @Benchmark
    public static void doFour() {
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
        execute(() -> {});
    }

    public static void execute(Runnable task) {
        for (int i = 0; i < 100_000_000; i++) {
            task.run();
        }
    }
}

Benchmark            Mode  Cnt  Score   Error  Units
Lambdas.doFour         ss       0.446           s/op
Lambdas.doOne          ss       0.006           s/op

If you look at -prof perfasm profile for doFour test, you would get a fat clue:

....[Hottest Methods (after inlining)]..............................................................
 32.19%         c2, level 4  org.openjdk.Lambdas$$Lambda$44.0x0000000800c258b8::run, version 664 
 26.16%         c2, level 4  org.openjdk.Lambdas$$Lambda$43.0x0000000800c25698::run, version 658

There are at least two hot lambdas, and those are represented by different classes. So what you are seeing is likely monomorphic (one target), then bimorphic (two targets), then polymorphic virtual call at task.run.

Virtual call has to choose which class to call the implementation from. The more classes you have, the worse it gets for optimizer. JVM tries to adapt, but it gets worse and worse as the run progresses. Roughly like this:

execute(() -> {}); // compiles with single target, fast
execute(() -> {}); // recompiles with two targets, a bit slower
execute(() -> {}); // recompiles with three targets, slow
execute(() -> {}); // continues to be slow

Now, the elimination of the loop requires seeing through the task.run(). In monomorphic and bimorphic cases it is easy: one or both targets are inlined, their empty body is discovered, done. In both cases, you would have to do typechecks, which means it is not completely free, with bimorphic costing a bit extra. When you experience a polymorphic call, there is no such luck at all: it is opaque call.

You can add two more benchmarks in the mix to see it:

    @Benchmark
    public static void doFour_Same() {
        Runnable l = () -> {};
        execute(l);
        execute(l);
        execute(l);
        execute(l);
    }

    @Benchmark
    public static void doFour_Pair() {
        Runnable l1 = () -> {};
        Runnable l2 = () -> {};
        execute(l1);
        execute(l1);
        execute(l2);
        execute(l2);
    }

Which would then yield:

Benchmark            Mode  Cnt  Score   Error  Units
Lambdas.doFour         ss       0.445           s/op ; polymorphic
Lambdas.doFour_Pair    ss       0.016           s/op ; bimorphic
Lambdas.doFour_Same    ss       0.008           s/op ; monomorphic
Lambdas.doOne          ss       0.006           s/op

This also explains why your "fixes" help:

using 1-2 nested classes instead of lambdas,

Bimorphic inlining.

using 1-2 lambda instances instead of 4 different ones,

Bimorphic inlining.

not calling task.run() lambdas inside the loop,

Avoids polymorphic (opaque) call in the loop, allows loop elimination.

inlining the execute() method, still maintaining 4 different lambdas.

Avoids a single call site that experiences multiple call targets. In other words, turns a single polymorphic call site into a series of monomorphic call sites each with its own target.

You, sir, are a scholar and a gentleman! Thank you, I hoped to summon you :-). Makes complete sense now. Somehow at three in the morning I did not see that... — Petr Janeček, Jan 15 '22 at 09:28
What tripped me, I think, is that it looks like four separate call sites, each with only a single concrete class given to it. But that's not how it works, is it? I need to re-read https://shipilev.net/blog/2015/black-magic-method-dispatch/, and invite everyone seeing this to do the same. — Petr Janeček, Jan 15 '22 at 10:26

Why does running multiple lambdas in loops suddenly slow down?

1 Answers1