Parallel stream processing vs Thread pool processing Vs Sequential processing

Question

I was just evaluating, which of the code snippets performs better in java 8.

Snippet 1 (Processing in the main thread):

public long doSequence() {
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    ds.forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    });
    System.out.println("Sequence");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

Snippet 2 (Processing in parallel threads):

public long doParallel() {
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    ds.parallel().forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    });
    System.out.println("Parallel");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

Snippet 3 (Processing in parallel threads from a thread pool):

public long doThreadPoolParallel() throws InterruptedException, ExecutionException {
    ForkJoinPool customThreadPool = new ForkJoinPool(4);
    DoubleStream ds = IntStream.range(0, 100000).asDoubleStream();
    long startTime = System.currentTimeMillis();
    final AtomicLong al = new AtomicLong();
    customThreadPool.submit(() -> ds.parallel().forEach((num) -> {
        long n1 = new Double (Math.pow(num, 3)).longValue();
        long n2 = new Double (Math.pow(num, 2)).longValue();
        al.addAndGet(n1 + n2);
    })).get();
    System.out.println("Thread Pool");
    System.out.println(al.get());
    long endTime = System.currentTimeMillis();
    return (endTime - startTime);
}

Output is here:

Parallel
6553089257123798384
34 <--34 milli seconds

Thread Pool
6553089257123798384
23 <--23 milli seconds

Sequence
6553089257123798384
12 <--12 milli seconds!

What I expected was

1) Time for processing using thread pool should be minimum, but its not true.(Note that i have not included the thread pool creation time, so it should be fast)

2) Never expected code running in sequence to be the fastest, what should be the reason for that.

I am using a quad core processor.

Appreciate any help to explain the above ambiguity!

Have you read up on proper microbenchmarking in Java? https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java -- how are you actually invoking your benchmarks? Are you warming up the JVM? — Erwin Bolwidt, May 04 '18 at 05:18
@ErwinBolwidt I have not followed all the points mentioned in microbenchmarking. But i have done the JVM warming, before printing the above numbers. Sequential processing *always* faster than the other counterparts, which is really baffling! — Cyril Cherian, May 04 '18 at 05:29
It's likely your threads spend too much time on Atomic operations. — Bogdan Lukiyanchuk, May 04 '18 at 05:35
First of all, you should use `System.nanoTime()` to measure *elapsed time*. Further, if you claim to test stream processing, you should do stream processing instead of disguised loop code, i.e. `IntStream.range(0, 100000) .parallel() .map(num -> (long)Math.pow(num, 3) + (long)Math.pow(num, 2)) .sum()`. Then, try with larger ranges to see how it scales. This allows to identify the fixed overhead fraction. Note, by the way, how the cast to `long` becomes simpler when not obfuscating it via `new Double(…).longValue()`… — Holger, May 04 '18 at 06:16
As pointed by @Erwin should have used proper bench marking tool. used JMH and got expected results. Thanks Everyone! — Cyril Cherian, May 04 '18 at 06:39

score 3 · Accepted Answer · answered May 04 '18 at 05:59

Your comparison isn't perfect, surely because of lacking VM warm-up. When I simply repeat the executions, I get different results:

System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());
System.out.println("-------");
System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());
System.out.println("-------");
System.out.println(doParallel());
System.out.println(doThreadPoolParallel());
System.out.println(doSequence());

Results:

Parallel
6553089257123798384
65
Thread Pool
6553089257123798384
13
Sequence
6553089257123798384
14
-------
Parallel
6553089257123798384
9
Thread Pool
6553089257123798384
4
Sequence
6553089257123798384
8
-------
Parallel
6553089257123798384
8
Thread Pool
6553089257123798384
3
Sequence
6553089257123798384
8

As pointed out by @Erwin in comments, please check answers on this question (rule 1 in this case) for ideas on how to do this benchmarking correctly.

The default parallelism of a parallel stream isn't necessarily the same as that provided by a fork-join pool with as many threads as there are cores on the computer, although the difference between results is still negligible when I switch from your custom pool to the common fork join pool.

Well this is very much as I wanted it to be, Thread pool should perform better. Your results reflect that. I will run my snippets in a quite machine and benchmark it https://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java may be that is the difference. — Cyril Cherian, May 04 '18 at 06:07

score 1 · Answer 2 · answered May 04 '18 at 05:52

AtomicLong.addAndGet requires thread synchronization - every thread has to see the result of the previous addAndGet - you can count on the total being correct.

Although this is not the traditional synchronized synchronization, it still has an overhead. In JDK7, addAndGet employed a spinlock in Java code. In JDK8, it was turned into an intrinsic which is then implemented by a LOCK:XADD instruction emitted by HotSpot on the Intel platform.

It requires cache synchronization between CPU's, which has an overhead. It may even require stuff to be flushed and read from main memory, which is extremely slow compared to code that doesn't need to do that.

It's quite possible, since this synchronization overhead happens for every iteration in your test, that the overhead is larger than any performance gains made from parallelizing.

References:

Parallel stream processing vs Thread pool processing Vs Sequential processing

2 Answers2