2

While toying with the new Java streams, I have noticed something strange related to the performance of parallel streams. I've used a simple program that reads the words from a text file and counts the words having length > 5 (the test file has 30000 words):

    String contents = new String(Files.readAllBytes(Paths.get("text.txt")));
    List<String> words = Arrays.asList(contents.split("[\\P{L}]+"));
    long startTime;
    for (int i = 0; i < 100; i++) {
        startTime = System.nanoTime();
        words.parallelStream().filter(w -> w.length() > 5).count();
        System.out.println("Time elapsed [PAR]: " + (System.nanoTime() - startTime));
        startTime = System.nanoTime();
        words.stream().filter(w -> w.length() > 5).count();
        System.out.println("Time elapsed [SEQ]: " + (System.nanoTime() - startTime));
        System.out.println("------------------");
    }

This generates the following output on my machine (I mention only the first and the last 5 loop iterations):

Time elapsed [PAR]: 114185196
Time elapsed [SEQ]: 3222664
------------------
Time elapsed [PAR]: 569611
Time elapsed [SEQ]: 797113
------------------
Time elapsed [PAR]: 678231
Time elapsed [SEQ]: 414807
------------------
Time elapsed [PAR]: 755633
Time elapsed [SEQ]: 679085
------------------
Time elapsed [PAR]: 755633
Time elapsed [SEQ]: 393425
------------------
...
Time elapsed [PAR]: 90232
Time elapsed [SEQ]: 163785
------------------
Time elapsed [PAR]: 80396
Time elapsed [SEQ]: 154805
------------------
Time elapsed [PAR]: 83817
Time elapsed [SEQ]: 154377
------------------
Time elapsed [PAR]: 81679
Time elapsed [SEQ]: 186449
------------------
Time elapsed [PAR]: 68849
Time elapsed [SEQ]: 154804
------------------

Why is the first processing 100 times slower than the rest? Why is the parallel stream slower than the sequential one in the first iterations but it is twice as fast in the last iterations? Why do both the sequential and parallel streams become faster over time? Is this related to loop optimization?

Later edit: At Luigi's suggestion, I implemented the benchmark using JUnitBenchmarks:

List<String> words = null;

@Before
public void setup() {
    try {
        String contents = new String(Files.readAllBytes(Paths.get("text.txt")));
        words = Arrays.asList(contents.split("[\\P{L}]+"));
    } catch (IOException e) {
        e.printStackTrace();
    }
}

@BenchmarkOptions(benchmarkRounds = 100)
@Test
public void parallelTest() {
    words.parallelStream().filter(w -> w.length() > 5).count();
}

@BenchmarkOptions(benchmarkRounds = 100)
@Test
public void sequentialTest() {
    words.stream().filter(w -> w.length() > 5).count();
}

I also bumped up the number of words from the test file to 300000. The new results are:

Benchmark.sequentialTest: [measured 100 out of 105 rounds, threads: 1 (sequential)]

round: 0.08 [+- 0.04], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 62, GC.time: 1.53, time.total: 8.65, time.warmup: 0.81, time.bench: 7.85

Benchmark.parallelTest: [measured 100 out of 105 rounds, threads: 1 (sequential)]

round: 0.06 [+- 0.02], round.block: 0.00 [+- 0.00], round.gc: 0.00 [+- 0.00], GC.calls: 32, GC.time: 0.79, time.total: 6.82, time.warmup: 0.39, time.bench: 6.43

So it seems that the initial results were caused by a wrong microbenchmark configuration...

  • Related: [How do I write a correct micro-benchmark in Java?](http://stackoverflow.com/q/504103/1065197) – Luiggi Mendoza Apr 10 '14 at 21:31
  • Thanks for the really helpful link, Luiggi. That explains the slowness of the first processing. After implementing a warm-up, the difference in performance from the first iteration becomes similar to the one from the next iterations. – Alexandru Cojocaru Apr 11 '14 at 06:41
  • 1
    Note, if you invert the order of your tests, i.e. executing sequential first then parallel, you will obtain whole different results. It would be better using a framework for benchmarks like [JUnitBenchmark](http://labs.carrotsearch.com/junit-benchmarks.html) – Luiggi Mendoza Apr 11 '14 at 14:18
  • Added the JUnitBenchmark results. Thanks for the suggestion. Most likely the problem was caused by my wrong microbenchmarking code. – Alexandru Cojocaru Apr 11 '14 at 19:10

1 Answers1

2

The Hotspot JVM starts executing the program in interpreted mode, and compiles frequently used parts to native code after some analysis. The initial iterations of loops are generally slow because of this.

Joni
  • 108,737
  • 14
  • 143
  • 193
  • Thanks for explaining why the iterations become faster, Joni. It's interesting to note that the first iteration is always 5-6 times slower than the second iteration, while the next ones have roughly similar performance. If the compilation to native code happens after some iterations, why is the first one so much slower? Also, the question of why the parallel stream is initial 2 times slower but in the end becomes 2 times faster than the sequential one remains. – Alexandru Cojocaru Apr 11 '14 at 06:46
  • The compilation is triggered after a method has been called about 10000 times, so during the first iteration of your loop the JVM detects a hotspot in the library code you use and compiles it, making the subsequent runs faster. You can enable diagnostic flags so that the JVM prints out what it compiles and when. Explaining the parallel stream behavior requires some investigation.. – Joni Apr 11 '14 at 07:13
  • Ok, this explains the output that I get with the -XX:+PrintCompilation flag set. But I would really like to know why the parallel stream is slower... I thought that, in general, we should expect the parallel stream to be more efficient. – Alexandru Cojocaru Apr 11 '14 at 07:51