Why lambda IntStream.anyMatch() is 10 slower than naive implementation?

Question

I was recently profiling my code and found one interesting bottleneck in it. Here is benchmark :

@BenchmarkMode(Mode.Throughput)
@Fork(1)
@State(Scope.Thread)
@Warmup(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
@Measurement(iterations = 10, time = 1, timeUnit = TimeUnit.SECONDS)
public class Contains {

    private int[] ar = new int[] {1,2,3,4,5,6,7};

    private int val = 5;

    @Benchmark
    public boolean naive() {
        return contains(ar, val);
    }

    @Benchmark
    public boolean lambdaArrayStreamContains() {
        return Arrays.stream(ar).anyMatch(i -> i == val);
    }

    @Benchmark
    public boolean lambdaIntStreamContains() {
        return IntStream.of(ar).anyMatch(i -> i == val);
    }

    private static boolean contains(int[] ar, int value) {
        for (int arVal : ar) {
            if (arVal == value) {
                return true;
            }
        }
        return false;
    }

}

Result :

Benchmark                            Mode  Cnt       Score      Error  Units
Contains.lambdaArrayStreamContains  thrpt   10   22867.962 ± 1049.649  ops/s
Contains.lambdaIntStreamContains    thrpt   10   22983.800 ±  593.580  ops/s
Contains.naive                      thrpt   10  228002.406 ± 8591.186  ops/s

If shows that Array contains operation via lambda is 10 times slower than naive implementation with simple loop. I knew that lambdas should be a bit slower. But 10 times? Am I doing wrong lambda or this is some issue with java?

Just wondering: those benchmark annotations; is that some public framework; or some (really cool looking) invention of yours? — GhostCat, Feb 21 '17 at 18:21
@GhostCat this is http://openjdk.java.net/projects/code-tools/jmh/ — Dmitriy Dumanskiy, Feb 21 '17 at 18:22
You don't have a lot of data, so the constant overhead of the stream framework is going to dominate pretty much everything. This data is not at all surprising. If you had a lot more data it'd probably be more competitive. But as it stands, the stream implementation is going to do several allocations -- two in your code plus plenty more for the stream framework's implementation -- which are going to cost. — Louis Wasserman, Feb 21 '17 at 18:26
@LouisWasserman yes, I don't have a lot of data, but lambdas for other cases seems to be worse for few percents. However here difference is huge. All my arrays are really small (1,2 elements mostly). It seems for me like some issue. — Dmitriy Dumanskiy, Feb 21 '17 at 18:28
For small amounts of data like this, streams are going to be much slower. That's not a bug, and there's not really a way around that. The constant overhead of just setting up streams is going to dominate everything else. This isn't a bug in your code, it's not an issue in Java, it's just how it works. — Louis Wasserman, Feb 21 '17 at 18:30
To augment @Louis Wasserman’s comment, you are creating at least an `IntStream` instance and an `IntPredicate` instance via a capturing lambda expression and ten warmup iterations may not be enough to let HotSpot elide the allocations costs. And, if I get it right, with a measured speed of `22867 ops/s` vs. `228002 ops/s`, we’re talking about `0.04 ms` vs `0.004 ms` for the complete operation… — Holger, Feb 21 '17 at 18:36
I agree with the things said... And would suggest that you repeat with 10k entries instead of 10 :-) — GhostCat, Feb 21 '17 at 20:05
@GhostCat what the point of testing unreal case? As I said majority of my arrays are small, with 1,2 numbers. This is real data and not just for this benchmark. — Dmitriy Dumanskiy, Feb 21 '17 at 20:07
How much warmup did you have? Also people are throwing around streams and lambdas as if they were m&m's. The name should be telling STREAMS. If it is not an stream stick to simpler stuff. Lambdas are not for saving 3 lines of code — bichito, Feb 21 '17 at 20:12
How big is the “majority of your arrays”? You’re still talking about ten arrays total, like in your benchmark? Besides, in your benchmark, you are using the same array and the same search value, in a way that is recognizable by the JVM. Maybe the JVM is indeed better at detecting that the values are always the same when you use an array instead of a Stream, but what does that really say for your real life scenario? — Holger, Feb 21 '17 at 20:14
The point is; when you are really concerned about performance... Are you sure you are looking into the thing that matters the most? Or are you doing zillions of different computations on tons and tons of these small arrays? — GhostCat, Feb 21 '17 at 20:15
@efekctive please have a look into benchmark. It shows number of warmup iterations. Streams means data flow. And data flow could be any length. So your complains here not clear for me. — Dmitriy Dumanskiy, Feb 21 '17 at 20:22
@Holger majority is 1,2 elements. However there are also arrays with hundreds of elements. JMH eliminate such side effects, so this is correct benchmark. — Dmitriy Dumanskiy, Feb 21 '17 at 20:24
@GhostCat yes, I'm really concerned about performance. My production servers currently handling 10k req/sec and I'm hosted on low-end hardware with few cores only. — Dmitriy Dumanskiy, Feb 21 '17 at 20:25
IMHO it is related to http://stackoverflow.com/q/11227809/247013 — Vit Khudenko, Feb 21 '17 at 20:28
If your real data is small like this, and performance is critical, then yes, the correct conclusion to draw is that you should handwrite loops instead of using Streams. — Louis Wasserman, Feb 21 '17 at 20:28
As @LouisWasserman is trying to tell you: the costs of the stream need to be spread around a long number of items otherwise is just not worth it. It defeats the purpose for which they were created. BTW which of the columns/rows is the warmup run? — bichito, Feb 21 '17 at 20:30
@Arhimed I checked with unsorted array. No difference. @efekctive I understand what Louis is saying. I didn't expect lambdas to perform super quick. What I didn't expect is such huge difference in performance. That's why I created this question. As It still doesn't seem to me normal. I may be wrong of course and this is totally ok in that case I'll avoid lambda in all my code just in case. Warmup code - `@Warmup(iterations = 10)`. — Dmitriy Dumanskiy, Feb 21 '17 at 20:36
@Dmitriy Dumanskiy: no, JMH is not a magic tool that can undo any benchmark mistake a programmer can make. It would also be a contradiction to the goal, if all optimizations were disabled, as you want to measure the real life performance of an optimizing JVM. It is up to you to use constant values that are placeholders for variable parameters correctly. — Holger, Feb 22 '17 at 09:51
@Holger do you see error in my benchmark? If so - please point me to it. If no - what you're trying to say? Who said JMH is magic tool? — Dmitriy Dumanskiy, Feb 22 '17 at 12:03
This has been said [16 hours ago](http://stackoverflow.com/questions/42375003/why-lambda-intstream-anymatch-is-10-slower-than-naive-implementation?noredirect=1#comment71903186_42375003). You are using constants instead of variable values. You claimed that “JMH eliminate such side effects”, which would be magic, if JMH could do that. JMH does *offer* ways to avoid such errors, e.g. letting the tests run with changing parameters, however, you are not using them. — Holger, Feb 22 '17 at 12:25
@Holger You are wrong. This is not constant. This is regular variable. This benchmark shows exactly how it would behave in real code. And `@State(Scope.Thread)` does this. No magic here. — Dmitriy Dumanskiy, Feb 22 '17 at 12:39
This is a regular variable that has always the same value. Whether the JVM can detect this, can’t be determined without an actual test. All documentations and tutorials for JMH explicitly tell you that reading from an instance variable makes constant folding *harder*, but *not impossible*. And none of them ever claimed that “JMH eliminates such side effects”. It might be true that the particular JVM you used for this test didn’t fold this, but you are not doing yourself a favor answering a hint about a potential error source with that flat “JMH eliminates such side effects” claim. JMH doesn’t. — Holger, Feb 22 '17 at 12:50
"All documentations and tutorials for JMH explicitly tell you that reading from an instance variable makes constant folding harder, but not impossible". Can you please provide some proof? Because here is what documentation says : "This can be prevented by always reading the inputs from non-final instance fields of @State objects, computing the result based on those values, and follow the rules to prevent DCE." - http://hg.openjdk.java.net/code-tools/jmh/file/ed0a5f40acfb/jmh-samples/src/main/java/org/openjdk/jmh/samples/JMHSample_10_ConstantFold.java — Dmitriy Dumanskiy, Feb 22 '17 at 13:05
This isn’t a documentation, but just a comment in an example. Not unfounded for a snapshot, but the JVM is a moving target. [This article](http://www.oracle.com/technetwork/articles/java/architect-benchmarking-2266277.html) uses the phrase “the virtual machine is *unlikely* to attempt constant folding optimizations*”. [This article](http://java-performance.info/jmh/) states that even using `BlackHole` for the result “makes it much harder for JVM to optimize it (but not impossible!)”. You should always keep in mind that the obstacles of today are the trivial cases for tomorrow. Always *verify* — Holger, Feb 22 '17 at 13:21
Thanks. In fact I did verification before posting. All the same, so this is not a case. — Dmitriy Dumanskiy, Feb 22 '17 at 13:36

score 10 · Accepted Answer · answered Feb 22 '17 at 01:36

Your benchmark does not actually measure anyMatch performance, but rather a stream overhead. This overhead can appear significant when comparing to a very simple operation like a five-element array lookup.

The slowdown will not look so terrible if we swtich from relative to absolute numbers. Let's measure latency instead of throughput for a clearer picture. I've omitted lambdaIntStream benchmark since it works exactly the same way as lambdaArrayStream.

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  53,242 ± 2,034  ns/op
Contains.naive              avgt    5   5,876 ± 0,404  ns/op

5.8 ns is roughly 14 cycles of a 2.4 GHz CPU. The workload is so small that any extra cycle will be noticeable. So what is the overhead of stream operations?

Object allocation

Now rerun the benchmark with -prof gc profiler. It will show the amount of heap allocations:

Benchmark                                       Mode  Cnt     Score     Error   Units
Contains.lambdaArrayStream:·gc.alloc.rate.norm  avgt    5   152,000 ±   0,001    B/op
Contains.naive:·gc.alloc.rate.norm              avgt    5    ≈ 10⁻⁵              B/op

lambdaArrayStream allocates 152 bytes per iteration while naive benchmark allocates nothing. Of course, allocation is not free: there are at least 5 objects constructed to support anyMatch, and each takes several nanoseconds:

Lambda i -> i == val
IntPipeline.Head
Spliterators.IntArraySpliterator
MatchOps.MatchOp
MatchOps.MatchSink

Call stack

java.util.stream implementation is a bit complicated since it must support all combinations of stream sources, intermediate and terminal operations. If you look at the call stack of anyMatch in your benchmark, you'll see something like that:

    at bench.Contains.lambda$lambdaArrayStream$0(Contains.java:24)
    at java.util.stream.MatchOps$2MatchSink.accept(MatchOps.java:119)
    at java.util.Spliterators$IntArraySpliterator.tryAdvance(Spliterators.java:1041)
    at java.util.stream.IntPipeline.forEachWithCancel(IntPipeline.java:162)
    at java.util.stream.AbstractPipeline.copyIntoWithCancel(AbstractPipeline.java:498)
    at java.util.stream.AbstractPipeline.copyInto(AbstractPipeline.java:485)
    at java.util.stream.AbstractPipeline.wrapAndCopyInto(AbstractPipeline.java:471)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:230)
    at java.util.stream.MatchOps$MatchOp.evaluateSequential(MatchOps.java:196)
    at java.util.stream.AbstractPipeline.evaluate(AbstractPipeline.java:234)
    at java.util.stream.IntPipeline.anyMatch(IntPipeline.java:477)
    at bench.Contains.lambdaArrayStream(Contains.java:23)

Not all of these method calls can be inlined. Furthermore, JVM limits inlining to 9 levels, but here we see the deeper call stack. If we override the limit with -XX:MaxInlineLevel=20 the score will become a bit better:

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  33,294 ± 0,367  ns/op  (was 53,242)
Contains.naive              avgt    5   5,822 ± 0,207  ns/op

Loop optimizations

for iteration over an array is a trivial counted loop. JVM can apply a wide range of loop optimizatons here: loop peeling, loop unrolling etc. This does not work for a while-kind loop in forEachWithCancel method, which is used to traverse an IntStream. The effect of loop optimizations can be measured with -XX:LoopUnrollLimit=0 -XX:-UseLoopPredicate:

Benchmark                   Mode  Cnt   Score   Error  Units
Contains.lambdaArrayStream  avgt    5  33,153 ± 0,559  ns/op
Contains.naive              avgt    5   9,853 ± 0,150  ns/op  (was 5,876)

Conclusions

There is some overhead to construct and to traverse a stream, but this is completely understood and cannot be considered a bug. I would not say the overhead is big (even 50 ns/op is not that much); however, in this particular example the overhead is dominant because of an extremely tiny workload.

Why lambda IntStream.anyMatch() is 10 slower than naive implementation?

1 Answers1

Object allocation

Call stack

Loop optimizations

Conclusions

Linked