Trying to benchmark lambda performance

Question

I've read this post: Performance difference between Java 8 lambdas and anonymous inner classes and provided there article

and it there said:

Lambda invocation behaves exactly as anonymous class invocation

"Ok" I said and decided to write my own benchmark, I've used jmh, here it is below (I've also added benchmark for method reference).

public class MyBenchmark {

    public static final int TESTS_COUNT = 100_000_000;

    @Benchmark
    public void testMethod_lambda() {
        X x = i -> test(i);
        for (long i = 0; i < TESTS_COUNT; i++) {
            x.x(i);
        }
    }
    @Benchmark
    public void testMethod_methodRefernce() {
        X x = this::test;
        for (long i = 0; i < TESTS_COUNT; i++) {
            x.x(i);
        }
    }
    @Benchmark
    public void testMethod_anonymous() {
        X x = new X() {
            @Override
            public void x(Long i) {
                test(i);
            }
        };
        for (long i = 0; i < TESTS_COUNT; i++) {
            x.x(i);
        }
    }

    interface X {
        void x(Long i);
    }

    public void test(Long i) {
        if (i == null) System.out.println("never");
    }
}

And the results (on Intel Core i7 4770k) are:

Benchmark                                     Mode  Samples   Score  Score error  Units
t.j.MyBenchmark.testMethod_anonymous         thrpt      200  16,160        0,044  ops/s
t.j.MyBenchmark.testMethod_lambda            thrpt      200   4,102        0,029  ops/s
t.j.MyBenchmark.testMethod_methodRefernce    thrpt      200   4,149        0,022  ops/s

So, as you can see there is 4x difference between lambda and anonymous method invocation, where lambda is 4x slower.

The question is: what am I doing wrong or I have misunderstanding of performance theory about lambdas?

EDIT:

# VM invoker: C:\Program Files\Java\jre1.8.0_31\bin\java.exe
# VM options: <none>
# Warmup: 20 iterations, 1 s each
# Measurement: 20 iterations, 1 s each

Looking at the byte code generated, I notice that the anonymous class uses `invokespecial` to construct the object, while the lambda uses `invokedynamic`. I'd be curious to see if there's any speed difference if you construct one instance of each object outside of the test itself. — resueman, Jan 15 '16 at 18:54
why are you running a loop inside a JMH test? JMH does the looping for you. and why are you using that *"never"* thing instead of blackhole? the compiler is smart enough to optimize it away. — the8472, Jan 15 '16 at 18:58
also, the "*Lambda invocation behaves exactly as anonymous class invocation*" thing you quote.. you conveniently left out the footnote. — the8472, Jan 15 '16 at 19:00
Note that if you remove the creation of the lambda, method reference and anonymous classes to private static fields (and reuse them), your benchmark shows that method reference and anonymous classes have the same performane, while lambda are less performant. — Tunaki, Jan 15 '16 at 19:06
*"What am I doing wrong?"* - 1) Writing a benchmark loop manually is a mistake; JMH does it for you. 2) Results are not consumed by the Blackhole => JVM may oddly optimize the code, and you'll measure not what you expect to. — apangin, Jan 15 '16 at 19:06
@Tunaki yes, I know, it is obvious - in this case we will reduce time for creation their instance. But how often will you do such things (lambdas as static fields) in real life? — Andremoniy, Jan 15 '16 at 19:07
@Andremoniy: "current implementation" means subject to change. You cannot assume it still is like it was when those slides were created. In other words, you cannot assume that the statement you quoted is true. And considering that the quoted statement is the starting point of your question that is a pretty big omission. — the8472, Jan 15 '16 at 19:58

score 9 · Accepted Answer · answered Jan 15 '16 at 19:18

9

The problem is in your benchmark: you are the victim of dead code elimination.

JIT-compiler is quite smart to understand sometimes that the result of automatic boxing is never null, so for anonymous class it simply removed your check which in turn made the loop body almost empty. Replace it with something less obvious (for JIT) like this:

public void test(Long i) {
    if (i == Long.MAX_VALUE) System.out.println("never");
}

And you will observe the same performace (anonymous class becomes slower, while lambda and method reference perform at the same level).

For lambda/method reference it did not made the same optimization for some reason. But you should not worry: it's unlikely that you will have such method in real code which can be optimized out completely.

In general @apangin is right: use Blackhole instead.

answered Jan 15 '16 at 19:18

Tagir Valeev

97,161
19
222
334

3

Yep, I just tested and this is indeed the problem. In general @Andremoniy, you should not try to be smart in your benchmark, JIT or the compiler will be smarter. – Tunaki Jan 15 '16 at 19:22
1

@TagirValeev yep, looks like I really mistaken. Thank you very much for your lesson – Andremoniy Jan 15 '16 at 20:06
I’m really surprised that `i == Long.MAX_VALUE` is something “less obvious”, enough to fool a JVM. after all, we have a perfectly predictable loop counter and a boxing, an operation known to the JVM… – Holger Jan 18 '16 at 10:50

Brian Goetz · Answer 2 · 2016-01-15T20:28:36.003

In addition to the issues raised by @TagirValeev, the benchmark approach you are taking is fundamentally flawed, because you are measuring a composite metric (despite your attempts not to.)

The significant costs you want to measure independently are linkage, capture, and invocation. But all your tests smear together some amount of each, poisoning your results. My advice would be to focus only on invocation cost -- this is the most relevant to overall application throughput, and also the easiest to measure (because it is less influenced by caching at multiple levels.)

Bottom line: measuring performance in dynamically compiled environments is really, really hard. Even with JMH.

score 0 · Answer 3 · edited Mar 27 '16 at 09:59

0

My question is another example how you shouldn't do benchmarking. I've recreated my test according to the advice in other answers here.

Hope now it is near to correctness, because it shows that there isn't any significant difference between lambda's and anon's method invocation performance. See it below:

@State(Scope.Benchmark)
public class MyBenchmark {

    @Param({"1", "100000", "500000"})
    public int arg;

    @Benchmark
    public void testMethod_lambda(Blackhole bh) {
        X x = (i, bh2) -> test(i, bh2);
        x.x(arg, bh);
    }

    @Benchmark
    public void testMethod_methodRefernce(Blackhole bh) {
        X x = this::test;
        x.x(arg, bh);
    }

    @Benchmark
    public void testMethod_anonymous(Blackhole bh) {
        X x = new X() {
            @Override
            public void x(Integer i, Blackhole bh) {
                test(i, bh);
            }
        };
        x.x(arg, bh);
    }

    interface X {
        void x(Integer i, Blackhole bh);
    }

    public void test(Integer i, Blackhole bh) {
        bh.consume(i);
    }
}
Benchmark                                     (arg)   Mode  Samples          Score  Score error  Units
t.j.MyBenchmark.testMethod_anonymous              1  thrpt      200  415893575,928  1353627,574  ops/s
t.j.MyBenchmark.testMethod_anonymous         100000  thrpt      200  394989882,972  1429490,555  ops/s
t.j.MyBenchmark.testMethod_anonymous         500000  thrpt      200  395707755,557  1325623,340  ops/s
t.j.MyBenchmark.testMethod_lambda                 1  thrpt      200  418597958,944  1098137,844  ops/s
t.j.MyBenchmark.testMethod_lambda            100000  thrpt      200  394672254,859  1593253,378  ops/s
t.j.MyBenchmark.testMethod_lambda            500000  thrpt      200  394407399,819  1373366,572  ops/s
t.j.MyBenchmark.testMethod_methodRefernce         1  thrpt      200  417249323,668  1140804,969  ops/s
t.j.MyBenchmark.testMethod_methodRefernce    100000  thrpt      200  396783159,253  1458935,363  ops/s
t.j.MyBenchmark.testMethod_methodRefernce    500000  thrpt      200  395098696,491  1682126,737  ops/s

edited Mar 27 '16 at 09:59

halfer

19,824
17
99
186

answered Jan 15 '16 at 20:53

Andremoniy

34,031
20
135
241

1

I think you still want to factor out the creation of the IC/lambda into static code. Otherwise you're still poisoning some invocation measurements with linking/capture costs. – Brian Goetz Jan 15 '16 at 21:44
@BrianGoetz well, I just want to show that I was wrong about my previous conclusions and correct benchmark shows that there isn't any difference in invocation performance. In what am I wrong in this case? – Andremoniy Jan 15 '16 at 21:46
2

You're measuring the sum of the cost of two different operations which have independent performance profiles. Like measuring the cost of "go to the store and buy milk" and "sit down at the kitchen table and drink a glass of milk". Because, your common operation is not "buy milk + drink milk", it's "drink milk". It just so happens that you have to buy some first before you can drink it. But clearly, the milk-buying is influenced by a lot of things (time of day, traffic patterns, what hours the store is open, etc) that have nothing to do with drinking it. You want to measure each separately. – Brian Goetz Jan 15 '16 at 21:49
Ok, I understand what do you mean. But I've tried to measure commonly used case (linkage + invocation). Hard to imagine that somebody will create static fields for such things. In common case `Intellij Idea` for example always suggest make replacement of anon. class to inline lambda if possible, so I've decided to measure it. – Andremoniy Jan 15 '16 at 22:01
1

I think you're still not getting it. Linkage is (a) sooooo much more expensive than invocation, (b) only done once, and (c) subject to distortion by many factors. If you think the common case is link+invoke once, then almost by definition you don't care about performance! (Linkage for inner classes involves going to the file system, reading classfile bytes off disk, and parsing/verifying/loading the class.) It's like saying the common operation is "build the building, and then take the elevator to the 10th floor." Part of what makes performance analysis hard is knowing *what* to measure! – Brian Goetz Jan 15 '16 at 22:06
Ah, LOL, OK, I got it. But how I can change all `x`-s to static code if I want to measure exactly instance method (`void test(...)`) invocation, not static? – Andremoniy Jan 15 '16 at 22:09
1

You can still have `test` be an instance method; statically capture an instance of X, and just measure x.x(..) in your @Bench method. – Brian Goetz Jan 15 '16 at 22:48
@Brian Goetz: but the linkage cost will be catched by the warmup, if the setup includes at least one warmup invocation. Of course, it’s still the case that the cost of instantiation and invocation are merged here, on the other hand, the cost of invocation is so low, that it is almost impossible to measure it at all. Regardless of which attempt you use to split the two costs, like storing the instances into static variables or factoring them out of a loop, it will likely give the JVM enough hints to elide most of the single invocation costs completely… – Holger Jan 18 '16 at 11:01
@Holger Yes, performance measurement is hard :) In addition to the many ways one can mess up the measurement, one can also mess up by not having clarity and judgment about _what to measure_. As the OP showed, just because JMH filters out the most obvious sources of measurement error, what is left can still be quite subtle. – Brian Goetz Jan 18 '16 at 15:32

Trying to benchmark lambda performance

3 Answers3