Very first call to the Stream API in your program is always quite slow, because you need to load many auxiliary classes, generate many anonymous classes for lambdas and JIT-compile many methods. Thus usually very first Stream operation takes several dozens of milliseconds. The consecutive calls are much faster and may fall beyond 1 us depending on the exact stream operation. If you exchange the parallel-stream test and sequential stream test, the sequential stream will be much faster. All the hard work is done by one who comes the first.
Let's write a JMH benchmark to properly warm-up your code and test all the cases independently:
import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;
import org.openjdk.jmh.annotations.*;
@Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(3)
@State(Scope.Benchmark)
public class StreamTest {
List<String> persons;
@Setup
public void setup() {
persons = new ArrayList<String>();
persons.add("AAA");
persons.add("BBB");
persons.add("CCC");
persons.add("DDD");
}
@Benchmark
public void loop() {
for(String person : persons)
System.err.println(person);
}
@Benchmark
public void stream() {
persons.stream().forEach(System.err::println);
}
@Benchmark
public void parallelStream() {
persons.parallelStream().forEach(System.err::println);
}
}
Here we have three tests: loop
, stream
and parallelStream
. Note that I changed the System.out
to System.err
. That's because System.out
is used normally to output the JMH results. I will redirect the output of System.err
to nul
, so the result should less depend on my filesystem or console subsystem (which is especially slow on Windows).
So the results are (Core i7-4702MQ CPU @ 2.2GHz, 4 cores HT, Win7, Oracle JDK 1.8.0_40):
Benchmark Mode Cnt Score Error Units
StreamTest.loop avgt 30 42.410 ± 1.833 us/op
StreamTest.parallelStream avgt 30 76.440 ± 2.073 us/op
StreamTest.stream avgt 30 42.820 ± 1.389 us/op
What we see is that stream
and loop
produce exactly the same result. The difference is statistically insignificant. Actually Stream API is somewhat slower than loop, but here the slowest part is the PrintStream
. Even with output to nul
the IO subsystem is very slow compared to other operations. So we just measured not the Stream API or loop speed, but println
speed.
Also see, it's microseconds, thus stream version actually works 1000 times faster than in your test.
Why parallelStream
is much slower? Just because you cannot parallelize the writes to the same PrintStream
, because it is internally synchronized. So the parallelStream
did all the hard work to splitting 4-element list to the 4 sub-tasks, schedule the jobs in the different threads, synchronize them properly, but it's absolutely futile as the slowest operation (println
) cannot perform in parallel: while one of threads is working, others are waiting. In general it's useless to parallelize the code which synchronizes on the same mutex (which is your case).