Even though the presented results are completely irrelevant, the observed effect actually takes place: the Stream API do has an overhead which for such simple tasks cannot be totally eliminated in real applications even after warm-up. Let's write a JMH benchmark:
@Warmup(iterations = 5, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 500, timeUnit = TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(3)
@State(Scope.Benchmark)
public class IterativeSum {
@Param({ "100", "10000", "1000000" })
private int n;
public static long iterativeSum(long n) {
long sum = 0;
for(long i=1; i<=n; i++) {
sum+=i;
}
return sum;
}
@Benchmark
public long is() {
return iterativeSum(n);
}
}
Here's baseline test: plain loop. The results on my box are the following:
Benchmark (n) Mode Cnt Score Error Units
IterativeSum.is 100 avgt 30 0.074 ± 0.001 us/op
IterativeSum.is 10000 avgt 30 6.361 ± 0.009 us/op
IterativeSum.is 1000000 avgt 30 688.527 ± 0.910 us/op
Here's your version of Stream API based iteration:
public static long sequentialSumBoxed(long n) {
return Stream.iterate(1L, i -> i+1).limit(n)
.reduce(0L, (i,j) -> i+j);
}
@Benchmark
public long ssb() {
return sequentialSumBoxed(n);
}
The results look like this:
Benchmark (n) Mode Cnt Score Error Units
IterativeSum.ssb 100 avgt 30 1.253 ± 0.084 us/op
IterativeSum.ssb 10000 avgt 30 134.959 ± 0.421 us/op
IterativeSum.ssb 1000000 avgt 30 9119.422 ± 22.817 us/op
Very disappointing: 13-21x slower. This version has many boxing operations inside, that's why primitive stream specializations were created. Let's check non-boxed version:
public static long sequentialSum(long n) {
return LongStream.iterate(1L, i -> i+1).limit(n)
.reduce(0L, (i,j) -> i+j);
}
@Benchmark
public long ss() {
return sequentialSum(n);
}
The results are the following:
Benchmark (n) Mode Cnt Score Error Units
IterativeSum.ss 100 avgt 30 0.661 ± 0.001 us/op
IterativeSum.ss 10000 avgt 30 67.498 ± 5.732 us/op
IterativeSum.ss 1000000 avgt 30 1982.687 ± 38.501 us/op
Much better now, but still 2.8-10x times slower. An alternative would be to use the range:
public static long rangeSum(long n) {
return LongStream.rangeClosed(1, n).sum();
}
@Benchmark
public long rs() {
return rangeSum(n);
}
The results are the following:
Benchmark (n) Mode Cnt Score Error Units
IterativeSum.rs 100 avgt 30 0.316 ± 0.001 us/op
IterativeSum.rs 10000 avgt 30 28.646 ± 0.065 us/op
IterativeSum.rs 1000000 avgt 30 2158.962 ± 514.780 us/op
Now it's 3.1-4.5x times slower. The cause of this slowness is that Stream API has very long call chain which hits the MaxInlineLevel
JVM limit, so it cannot be fully inlined by default. You may increase this limit setting like -XX:MaxInlineLevel=20
and get the following result:
Benchmark (n) Mode Cnt Score Error Units
IterativeSum.rs 100 avgt 30 0.111 ± 0.001 us/op
IterativeSum.rs 10000 avgt 30 9.552 ± 0.017 us/op
IterativeSum.rs 1000000 avgt 30 729.935 ± 31.915 us/op
Much better: now it's only 1.05-1.5x slower.
The problem with this test is that the loop body of iterative version is very simple, thus could be efficiently unrolled and vectorized by JIT-compiler, and it's much harder to do this with the same efficiency for sophisticated Stream API code. However in real applications you're unlikely to sum consequtive numbers in a loop (why not write n*(n+1)/2
instead?). With real problems Stream API overhead is much lower even with default MaxInlineLevel
setting.