Here is a new attempt to present that curious difference in performance in the use of IntStream and reduce to do a computation on an integer array.
My question wasn't understood as legitimate. I was asked to do proper benchmark using JMH. So here it is.
In the following program, 2 methods do almost exactly the same processing. In the first version It does 3 identical computations 2 times, in the second version It does 2 identical computation 3 times.
In effect, both versions repeat the same computation 6 times.
package perf;
import org.openjdk.jmh.annotations.*;
import java.util.Random;
import java.util.stream.IntStream;
public class IntStreamBenchmark {
final static int ARRAY_LENGTH = 10_000_000;
@State(Scope.Benchmark)
public static class TestBuddy {
int[] as;
int xor;
@Setup(Level.Trial)
public void init(){
as = new int[ARRAY_LENGTH];
xor = 0;
Random random = new Random(0);
for (int i = 0; i < ARRAY_LENGTH; i++) {
as[i] = random.nextInt(1_000_000);
xor ^= as[i];
}
}
public int[] getArray(){
return as;
}
public boolean checkResult(int result) {
return result == xor;
}
}
/** Run 2 reduce expresssions 3 times each */
@Benchmark
@BenchmarkMode(Mode.Throughput)
@Fork(2)
public void benchmark_2_3(TestBuddy testBuddy) {
// repeat 3 times
for (int k = 0; k < 3; k++) {
int[] as = testBuddy.getArray();
int result;
// compute 2 stream reduce expressions
result = IntStream.of(as).reduce(0, (x, y) -> x^y);
result |= IntStream.of(as).reduce(0, (x, y) -> x^y);
if( !testBuddy.checkResult(result) ){
throw new RuntimeException("Calculation error.");
}
}
}
/** Run 3 reduce expresssions 2 times each */
@Benchmark
@BenchmarkMode(Mode.Throughput)
@Fork(2)
public void benchmark_3_2(TestBuddy testBuddy) {
// repeat 2 times
for (int k = 0; k < 2; k++) {
int[] as = testBuddy.getArray();
int result;
// compute 3 stream reduce expressions
result = IntStream.of(as).reduce(0, (x, y) -> x^y);
result |= IntStream.of(as).reduce(0, (x, y) -> x^y);
result |= IntStream.of(as).reduce(0, (x, y) -> x^y);
if( !testBuddy.checkResult(result) ){
throw new RuntimeException("Calculation error.");
}
}
}
}
And here is the verdict from JMH:
# Run complete. Total time: 00:06:51
REMEMBER: The numbers below are just data. To gain reusable insights, you need to follow up on
why the numbers are the way they are. Use profilers (see -prof, -lprof), design factorial
experiments, perform baseline and negative tests that provide experimental control, make sure
the benchmarking environment is safe on JVM/OS/HW level, ask for reviews from the domain experts.
Do not assume the numbers tell you what you want them to tell.
Benchmark Mode Cnt Score Error Units
IntStreamBenchmark.benchmark_2_3 thrpt 10 16.010 ± 0.196 ops/s
IntStreamBenchmark.benchmark_3_2 thrpt 10 1.305 ± 0.279 ops/s
You immediately see that the throughput went from 16 to 1.3. That is huge. I don't see an obvious reason for that.
My question is: what is the reason, in the working of hotspot or anywhere, that can create such a difference?
About the original program
Below you can find the program I posted initially. I was told it is a microbenchmark and that it was doing it "wrong". I don't agree with that.
That program is the last in a series of tests I did to understand the problem and characterize the circunstances when the effect happens.
Initilaly, I did fill the array with non-zero data, I did proper computations, I did check the result to make sure that it wasn't optimized out, I did run the code for a prolonged time to see if there is a warmup effect. All these tests convinced me that there is something wrong in the way the JVM optimizes this code.
Then, in an effort to identify the cause of the problem, I stripped the program down to its bare minimum. I removed all code that I could remove and still observe the drop in performance I had identified. I understand the resulting code is suspicious as a benchmark. But it was not intended to prove the effect, it was there to show the circumstances when the problem occurs.
I also wanted people to try the code on different machines, to see whether it is related to some peculiarity of my hardware, or whether it is more systematic. So far I have seen it happen everywhere I tried, on 2 computers and 4 versions of the JVM, up to openjdk 18.
Looking at he output of the original code, which is not visible in the benchmark above, you see more precisely what happens. One statement first runs perfectly well, in ~150 ms, but later consistently needs over 1000 ms to execute. I can imagine that hotspot identifies some repetition in the code and suddenly decides to optimize some code out. But I cannot see any explanation why hotspot would decide to switch to a more laborious way to execute the code. That would be a bug.
Original post
I was doing a benchmark on various methods to compute the max of an integer array. Some methods involved streams. But I couldn't get any meaningful result because the run times can change a lot for no logical reason.
Here is a stripped down version of my program.
package perf;
import java.util.Arrays;
import java.util.Random;
import java.util.function.Supplier;
import java.util.stream.IntStream;
public class MaxOfInts3 {
public static void measure(String name, Supplier<Integer> computation, int samples){
System.gc();
for( int i = 0 ; i < samples ; i++ ){
long t0 = System.currentTimeMillis();
int result = computation.get();
long t1 = System.currentTimeMillis();
long time = t1 - t0;
System.out.format("%-20s %d: %d, %d ms\n", name, i, result, time);
}
System.out.println();
}
public static void benchmark(int len){
System.out.println("Create array of size " + len + "...");
final int[] as = new int[len];
Random rnd = new Random();
for( int i = 0 ; i < len ; i++ ) {
as[i] = rnd.nextInt(1_000_000);
}
// final int[] bs = new int[len];
// Supplier<Integer> comp0 = () -> IntStream.of(bs).reduce((x,y) -> 0).orElse(0);
// Supplier<Integer> comp1 = () -> Arrays.stream(bs).reduce((x,y) -> 1).orElse(1);
// comp0.get();
// comp1.get();
measure("Compute max", () -> Arrays.stream(as).reduce((x,y) -> x > y ? x : y).getAsInt(), 3);
measure("Compute min", () -> Arrays.stream(as).reduce((x,y) -> x < y ? x : y).getAsInt(), 3);
measure("Compute max again", () -> Arrays.stream(as).reduce((x,y) -> x > y ? x : y).getAsInt(), 3);
measure("Compute min again", () -> Arrays.stream(as).reduce((x,y) -> x < y ? x : y).getAsInt(), 3);
}
public static void main(String[] args) {
benchmark(300_000_000);
}
}
When I run the program as it is, I get the following output.
Create array of size 300000000...
Compute max 0: 999999, 173 ms
Compute max 1: 999999, 143 ms
Compute max 2: 999999, 124 ms
Compute min 0: 0, 176 ms
Compute min 1: 0, 156 ms
Compute min 2: 0, 125 ms
Compute max again 0: 999999, 1013 ms
Compute max again 1: 999999, 1232 ms
Compute max again 2: 999999, 1139 ms
Compute min again 0: 0, 1283 ms
Compute min again 1: 0, 1165 ms
Compute min again 2: 0, 1066 ms
As you can see, the same method called twice returns very different times between the first and the second call.
Further investigation shows that it doesn't matter which computations I run, the first 2 runs are fast, the subsequent runs are slow. Interestingly, I can call my Supplier many times, it doesn't change the result.
It seems to depend on the number of times I stream an integer array inside of a Supplier. If I define and use more than two Suppliers that work by streaming and reducing an integer array then everything becomes slow. For instance, If I uncomment the commented lines, all 4 benchmarks are slow.
For reference, I tried it on a MacBook Pro with Java 13.0.3 and on Windows 10 with Java 11.0.2. In both cases I see the same thing.
Do you people also see this behaviour? And is there a good explanation why this happens?
PS: Here is a different version of the benchmark method that is maybe more to the point.
public static void benchmark(int len){
final int[] as = new int[len];
Supplier<Integer> comp0 = () -> IntStream.of(as).reduce((x,y) -> 0).orElse(0);
Supplier<Integer> comp1 = () -> IntStream.of(as).reduce((x,y) -> 1).orElse(1);
Supplier<Integer> comp2 = () -> IntStream.of(as).reduce((x,y) -> 2).orElse(2);
measure("comp0", comp0, 3);
measure("comp1", comp1, 3);
measure("comp0", comp0, 3);
measure("comp1", comp1, 3);
measure("comp2", comp2, 3);
measure("comp0", comp0, 3);
measure("comp1", comp1, 3);
measure("comp2", comp2, 3);
}