-2
import java.util.ArrayList;
import java.util.List;

public class IterationBenchmark {

    public static void main(String args[]){
        List<String> persons = new ArrayList<String>();
        persons.add("AAA");
        persons.add("BBB");
        persons.add("CCC");
        persons.add("DDD");
        long timeMillis = System.currentTimeMillis();
        for(String person : persons)
            System.out.println(person);
        System.out.println("Time taken for legacy for loop : "+
                  (System.currentTimeMillis() - timeMillis));
        timeMillis = System.currentTimeMillis();
        persons.stream().forEach(System.out::println);
        System.out.println("Time taken for sequence stream : "+
                  (System.currentTimeMillis() - timeMillis));
        timeMillis = System.currentTimeMillis();
        persons.parallelStream().forEach(System.out::println);
        System.out.println("Time taken for parallel stream : "+
                  (System.currentTimeMillis() - timeMillis));

    }
}

Output:

AAA
BBB
CCC
DDD
Time taken for legacy for loop : 0

AAA
BBB
CCC
DDD
Time taken for sequence stream : 49

CCC
DDD
AAA
BBB
Time taken for parallel stream : 3

Why the Java 8 Stream API performance is very low compare to legacy for loop?

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334
Bala
  • 33
  • 5
  • 6
    Your benchmark is invalid. Benchmarking Java code is not just comparing the start and the end time. Why would you care if your program works 0 ms or 49 ms? Can you even blink that fast? You should not care about this. And if you think that if you repeat such code several times it will produce the same performance, you're absolutely wrong. – Tagir Valeev Sep 01 '15 at 10:13
  • 4
    [How do I write a correct micro-benchmark in Java?](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) – Holger Sep 01 '15 at 10:19

1 Answers1

12

Very first call to the Stream API in your program is always quite slow, because you need to load many auxiliary classes, generate many anonymous classes for lambdas and JIT-compile many methods. Thus usually very first Stream operation takes several dozens of milliseconds. The consecutive calls are much faster and may fall beyond 1 us depending on the exact stream operation. If you exchange the parallel-stream test and sequential stream test, the sequential stream will be much faster. All the hard work is done by one who comes the first.

Let's write a JMH benchmark to properly warm-up your code and test all the cases independently:

import java.util.concurrent.TimeUnit;
import java.util.*;
import java.util.stream.*;

import org.openjdk.jmh.annotations.*;

@Warmup(iterations = 5, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
@Measurement(iterations = 10, time = 1000, timeUnit = TimeUnit.MILLISECONDS)
@BenchmarkMode(Mode.AverageTime)
@OutputTimeUnit(TimeUnit.MICROSECONDS)
@Fork(3)
@State(Scope.Benchmark)
public class StreamTest {
  List<String> persons;
  @Setup
  public void setup() {
    persons = new ArrayList<String>();
    persons.add("AAA");
    persons.add("BBB");
    persons.add("CCC");
    persons.add("DDD");
  }

  @Benchmark
  public void loop() {
    for(String person : persons)
      System.err.println(person);
  }

  @Benchmark
  public void stream() {
    persons.stream().forEach(System.err::println);
  }

  @Benchmark
  public void parallelStream() {
    persons.parallelStream().forEach(System.err::println);
  }
}

Here we have three tests: loop, stream and parallelStream. Note that I changed the System.out to System.err. That's because System.out is used normally to output the JMH results. I will redirect the output of System.err to nul, so the result should less depend on my filesystem or console subsystem (which is especially slow on Windows).

So the results are (Core i7-4702MQ CPU @ 2.2GHz, 4 cores HT, Win7, Oracle JDK 1.8.0_40):

Benchmark                  Mode  Cnt   Score   Error  Units
StreamTest.loop            avgt   30  42.410 ± 1.833  us/op
StreamTest.parallelStream  avgt   30  76.440 ± 2.073  us/op
StreamTest.stream          avgt   30  42.820 ± 1.389  us/op

What we see is that stream and loop produce exactly the same result. The difference is statistically insignificant. Actually Stream API is somewhat slower than loop, but here the slowest part is the PrintStream. Even with output to nul the IO subsystem is very slow compared to other operations. So we just measured not the Stream API or loop speed, but println speed.

Also see, it's microseconds, thus stream version actually works 1000 times faster than in your test.

Why parallelStream is much slower? Just because you cannot parallelize the writes to the same PrintStream, because it is internally synchronized. So the parallelStream did all the hard work to splitting 4-element list to the 4 sub-tasks, schedule the jobs in the different threads, synchronize them properly, but it's absolutely futile as the slowest operation (println) cannot perform in parallel: while one of threads is working, others are waiting. In general it's useless to parallelize the code which synchronizes on the same mutex (which is your case).

Tagir Valeev
  • 97,161
  • 19
  • 222
  • 334