Should I always use a parallel stream when possible?

Question

With Java 8 and lambdas it's easy to iterate over collections as streams, and just as easy to use a parallel stream. Two examples from the docs, the second one using parallelStream:

myShapesCollection.stream()
    .filter(e -> e.getColor() == Color.RED)
    .forEach(e -> System.out.println(e.getName()));

myShapesCollection.parallelStream() // <-- This one uses parallel
    .filter(e -> e.getColor() == Color.RED)
    .forEach(e -> System.out.println(e.getName()));

As long as I don't care about the order, would it always be beneficial to use the parallel? One would think it is faster dividing the work on more cores.

Are there other considerations? When should parallel stream be used and when should the non-parallel be used?

^{(This question is asked to trigger a discussion about how and when to use parallel streams, not because I think always using them is a good idea.)}

score 884 · Accepted Answer · edited May 10 '17 at 03:07

884

A parallel stream has a much higher overhead compared to a sequential one. Coordinating the threads takes a significant amount of time. I would use sequential streams by default and only consider parallel ones if

I have a massive amount of items to process (or the processing of each item takes time and is parallelizable)
I have a performance problem in the first place
I don't already run the process in a multi-thread environment (for example: in a web container, if I already have many requests to process in parallel, adding an additional layer of parallelism inside each request could have more negative than positive effects)

In your example, the performance will anyway be driven by the synchronized access to System.out.println(), and making this process parallel will have no effect, or even a negative one.

Moreover, remember that parallel streams don't magically solve all the synchronization problems. If a shared resource is used by the predicates and functions used in the process, you'll have to make sure that everything is thread-safe. In particular, side effects are things you really have to worry about if you go parallel.

In any case, measure, don't guess! Only a measurement will tell you if the parallelism is worth it or not.

edited May 10 '17 at 03:07

Pang

9,564
146
81
122

answered Dec 04 '13 at 12:37

JB Nizet

678,734
91
1,224
1,255

29

Good answer. I would add that if you have a massive amount of items to process, that only increases the thread coordination issues; it's only when processing of each items takes time and is parallelizable that parallelization might be useful. – Warren Dew Apr 08 '14 at 16:12
23

@WarrenDew I disagree. The Fork/Join system will simply split the N items into, for example, 4 parts, and process these 4 parts sequentially. The 4 results will then be reduced. If massive really is massive, even for fast unit processing, parallelization can be effective. But as always, you have to measure. – JB Nizet Apr 08 '14 at 16:49
i have a collection of objects that implement `Runnable` that I call `start()` to use them as `Threads`, is it ok to change that to using java 8 streams in a `.forEach()` parallelized ? Then i'd be able to strip the thread code out of the class. But are there any downsides? – ycomp Jun 05 '16 at 18:56
1

@JBNizet If 4 parts pocess sequentially, then there is no difference of it being process parallels or sequentially know? Pls clarify – Harshana Jul 27 '16 at 16:31
5

@Harshana he obviously means that the elements of each of the 4 parts will be processed sequentially. However, the parts themselves may be processed simultaneously. In other words, if you have several CPU cores available, each part can run on its own core independently of the other parts, while processing its own elements sequentially. (NOTE: I don't know, if this is how parallel Java streams work, I'm just trying to clarify what JBNizet meant.) – tomorrow Nov 17 '16 at 15:37
How much is "massive"? As always, measurement is key, but what should one generally classify as "massive"? – qwelyt Oct 10 '18 at 07:31
@JBNizet I think last point should have been made bold `In any case, measure, don't guess! Only a measurement will tell you if the parallelism is worth it or not.`. Parallel read on list will push java to spawn threads to read the list. When working with process which are not spawning threads parallelism becomes helpful. For example running a task in spring managed bean where spring keeps one thread for processing the task inside the bean. Every scenario of processing is different and without tweaking and measuring what improves processing or not it will be incomplete analogy. – Acewin Jan 30 '19 at 00:46
@JBNizet flawless answer! It's a checklist to visit before adding parallel streams in the code. – Gaurav Jun 30 '19 at 10:15
If there are very few values that match the `filter` predicate, it could help, even with the `println`. – Solomon Ucko May 20 '20 at 01:41
Thanks, in my case (handling 16k or 60k entries list) it is slower :-) – Gunnar Apr 16 '21 at 11:36

score 298 · Answer 2 · edited Jul 10 '20 at 15:09

298

The Stream API was designed to make it easy to write computations in a way that was abstracted away from how they would be executed, making switching between sequential and parallel easy.

However, just because its easy, doesn't mean its always a good idea, and in fact, it is a bad idea to just drop .parallel() all over the place simply because you can.

First, note that parallelism offers no benefits other than the possibility of faster execution when more cores are available. A parallel execution will always involve more work than a sequential one, because in addition to solving the problem, it also has to perform dispatching and coordinating of sub-tasks. The hope is that you'll be able to get to the answer faster by breaking up the work across multiple processors; whether this actually happens depends on a lot of things, including the size of your data set, how much computation you are doing on each element, the nature of the computation (specifically, does the processing of one element interact with processing of others?), the number of processors available, and the number of other tasks competing for those processors.

Further, note that parallelism also often exposes nondeterminism in the computation that is often hidden by sequential implementations; sometimes this doesn't matter, or can be mitigated by constraining the operations involved (i.e., reduction operators must be stateless and associative.)

In reality, sometimes parallelism will speed up your computation, sometimes it will not, and sometimes it will even slow it down. It is best to develop first using sequential execution and then apply parallelism where

(A) you know that there's actually benefit to increased performance and

(B) that it will actually deliver increased performance.

(A) is a business problem, not a technical one. If you are a performance expert, you'll usually be able to look at the code and determine (B), but the smart path is to measure. (And, don't even bother until you're convinced of (A); if the code is fast enough, better to apply your brain cycles elsewhere.)

The simplest performance model for parallelism is the "NQ" model, where N is the number of elements, and Q is the computation per element. In general, you need the product NQ to exceed some threshold before you start getting a performance benefit. For a low-Q problem like "add up numbers from 1 to N", you will generally see a breakeven between N=1000 and N=10000. With higher-Q problems, you'll see breakevens at lower thresholds.

But the reality is quite complicated. So until you achieve experthood, first identify when sequential processing is actually costing you something, and then measure if parallelism will help.

edited Jul 10 '20 at 15:09

BuZZ-dEE

6,075
12
66
96

answered Apr 29 '14 at 16:56

Brian Goetz

90,105
23
150
161

21

This post gives further details about the NQ model: http://gee.cs.oswego.edu/dl/html/StreamParallelGuidance.html – Pino Apr 01 '15 at 08:25
i think you need to elaborate on your claim about nondeterminism - AFAIK it is strictly impossible for a deterministic problem to become nondeterministic unless the problem itself is changed - this also applies to streams, all they do is wrap threads around segments of iterable collections; so unless the elements themselves interact with each other in a write-manner the whole stream cannot be nondeterministic --- and if your collection itself is nondeterministic from the beginning even the best for/while loop wont change that fact. – specializt Oct 05 '15 at 11:35
1

@specializt: I'm not sure I understand your question -- are you looking for someone to explain why multiple parallel threads make computation nondeterminstic? – Doradus Mar 09 '16 at 21:53
... no? Parallel threads dont make **anything** nondeterministic by themselves. Appropriate **algorithms** may do that but thats about it - the technology itself has nothing to do with the outcome of its usage / application, thats complete nonsense - its easy to write parallel code which does not change determinism, at all – specializt Mar 24 '16 at 11:24
4

@specializt: switching a stream from sequential to parallel *does* change the algorithm (in most cases). The determinism mentioned here is regarding properties your (arbitrary) operators *might* rely on (the Stream implementation can’t know that), but of course *shouldn’t* rely on. That’s what that section of this answer tried to say. If you care about the rules, you can have a deterministic result, just like you say, (otherwise parallel streams were quite useless), but there’s also the possibility of intentionally allowed non-determinism, like when using `findAny` instead of `findFirst`… – Holger Apr 04 '16 at 17:07
6

"First, note that parallelism offers no benefits other than the possibility of faster execution when more cores are available" -- or if you're applying an action that involves IO (e.g. `myListOfURLs.stream().map((url) -> downloadPage(url))...`). – Jules Jul 30 '16 at 09:26
@BrianGoetz, since you're the language designer, you must heed this: **Always use parallelism whenever the algorithm is semantically so.** Because even while sequential is faster when NQ-below-10k, the difference would be so small it doesn't matter either way. Yet if NQ is above and way above 10k, parallelism is the only way. – Pacerier Jun 18 '17 at 06:41
Indeed, another plus point to this is semantic correctness. One should mark parallelizable code as parallelizable and **leave it to the compiler** to gauge if it should actually do it parallelly or not. This is the exact same line of argument of choosing `if` vs `switch`: simply use whichever is semantically correct and leave it to the compiler to decide how best to **implement** it. – Pacerier Jun 18 '17 at 06:41
7

@Pacerier That's a nice theory, but sadly naive (see the 30-year history of attempts to build auto-parallelizing compilers for a start). Since it is not practical to guess right enough of the time to not annoy the user when we inevitably get it wrong, the responsible thing to do was just to let the user to say what they want. For most situations, the default (sequential) is right, and more predictable. – Brian Goetz Jun 22 '17 at 08:57
3

@Jules: Never use parallel streams for IO. They are solely meant for CPU intensive operations. Parallel streams use `ForkJoinPool.commonPool()` and you don't want blocking tasks to go there. – R2C2 Apr 22 '18 at 13:37
1

In NQ model you wrote: "Q is the computation per element" what does it mean, it is the time taken to compute one element? How do I calculate Q? – Ashish Aug 01 '18 at 14:21

Ram Patra · Answer 3 · 2016-08-21T17:40:37.943

I watched one of the presentations of Brian Goetz (Java Language Architect & specification lead for Lambda Expressions). He explains in detail the following 4 points to consider before going for parallelization:

Splitting / decomposition costs
– Sometimes splitting is more expensive than just doing the work!
Task dispatch / management costs
– Can do a lot of work in the time it takes to hand work to another thread.
Result combination costs
– Sometimes combination involves copying lots of data. For example, adding numbers is cheap whereas merging sets is expensive.
Locality
– The elephant in the room. This is an important point which everyone may miss. You should consider cache misses, if a CPU waits for data because of cache misses then you wouldn't gain anything by parallelization. That's why array-based sources parallelize the best as the next indices (near the current index) are cached and there are fewer chances that CPU would experience a cache miss.

He also mentions a relatively simple formula to determine a chance of parallel speedup.

NQ Model:

N x Q > 10000

where,
N = number of data items
Q = amount of work per item

In what units is the "amount of work per item" measured? What does 10000 stands for? — Nikolas Charalambidis, Oct 21 '20 at 07:29

score 4 · Answer 4 · edited Jun 20 '20 at 09:12

Other answers have already covered profiling to avoid premature optimization and overhead cost in parallel processing. This answer explains the ideal choice of data structures for parallel streaming.

As a rule, performance gains from parallelism are best on streams over ArrayList , HashMap , HashSet , and ConcurrentHashMap instances; arrays; int ranges; and long ranges. What these data structures have in common is that they can all be accurately and cheaply split into subranges of any desired sizes, which makes it easy to divide work among parallel threads. The abstraction used by the streams library to perform this task is the spliterator , which is returned by the spliterator method on Stream and Iterable.

Another important factor that all of these data structures have in common is that they provide good-to-excellent locality of reference when processed sequentially: sequential element references are stored together in memory. The objects referred to by those references may not be close to one another in memory, which reduces locality-of-reference. Locality-of-reference turns out to be critically important for parallelizing bulk operations: without it, threads spend much of their time idle, waiting for data to be transferred from memory into the processor’s cache. The data structures with the best locality of reference are primitive arrays because the data itself is stored contiguously in memory.

Source: Item #48 Use Caution When Making Streams Parallel, Effective Java 3e by Joshua Bloch

tkruse · Answer 5 · 2019-07-03T15:17:34.123

Never parallelize an infinite stream with a limit. Here is what happens:

    public static void main(String[] args) {
        // let's count to 1 in parallel
        System.out.println(
            IntStream.iterate(0, i -> i + 1)
                .parallel()
                .skip(1)
                .findFirst()
                .getAsInt());
    }

Result

    Exception in thread "main" java.lang.OutOfMemoryError
        at ...
        at java.base/java.util.stream.IntPipeline.findFirst(IntPipeline.java:528)
        at InfiniteTest.main(InfiniteTest.java:24)
    Caused by: java.lang.OutOfMemoryError: Java heap space
        at java.base/java.util.stream.SpinedBuffer$OfInt.newArray(SpinedBuffer.java:750)
        at ...

Same if you use .limit(...)

Explanation here: Java 8, using .parallel in a stream causes OOM error

Similarly, don't use parallel if the stream is ordered and has much more elements than you want to process, e.g.

public static void main(String[] args) {
    // let's count to 1 in parallel
    System.out.println(
            IntStream.range(1, 1000_000_000)
                    .parallel()
                    .skip(100)
                    .findFirst()
                    .getAsInt());
}

This may run much longer because the parallel threads may work on plenty of number ranges instead of the crucial one 0-100, causing this to take very long time.

Roman Sinyakov · Answer 6 · 2021-12-10T22:35:49.740

Collection.parallelStream() is a great way to do work in parallel. However you need to keep in mind that this effectively uses a common thread pool with only a few worker threads internally (number of threads equals to the number of cpu cores by default), see ForkJoinPool.commonPool(). If some of pool's tasks are a long-running I/O-bound work then others, potentially fast, parallelStream calls will get stuck waiting for the free pool threads. This obviously leads to a requirement of fork-join tasks being non-blocking and short or, in other words, cpu-bound. For better understanding of details I strongly recommend careful reading of java.util.concurrent.ForkJoinTask javadoc, here are some relevant quotes:

The efficiency of ForkJoinTasks stems from ... their main use as computational tasks calculating pure functions or operating on purely isolated objects.

Computations should ideally avoid synchronized methods or blocks, and should minimize other blocking synchronization

Subdividable tasks should also not perform blocking I/O

These indicate the main purpose of parallelStream() tasks as short computations over isolated in-memory structures. Also recommend checking out article Common parallel stream pitfalls

Should I always use a parallel stream when possible?

6 Answers6

Linked

Related