3

I have a list of objects from the database and i want to filter this list using the filter() method of the Stream class. New objects will be added to the database continuously so the list of objects could potentially become very large, possibly thousands of objects. I want to use a parallelStream to speed up the filter process but i was wondering how large the object list should approximately be to make the use of parallelStream benificial. I've read this thread about it: Should I always use a parallel stream when possible? And in this thread they agree that the dataset should be really large if you want to have any benefit from using a parallel stream. But how large is large? Say I have 200 records stored in my database and i retrieve them all for filtering, is using a parallelstream justified in this case? If not, how large should the dataset be? a 1000? 2000 perhaps? I'd love to know. Thank you.

Maurice
  • 6,698
  • 9
  • 47
  • 104
  • Probably in ten to hundrend thousands, if not even millions, but it really depends on the actions you want to make – Lino Jul 23 '18 at 13:04
  • Check this question out - the accepted answer should help: https://stackoverflow.com/questions/20375176/should-i-always-use-a-parallel-stream-when-possible – Ascalonian Jul 23 '18 at 13:04
  • 1
    The best answer is: it depends. You must test it. I found that the performance depends on what you are trying to do and on the computer(-processors) it is running. – Ralf Renz Jul 23 '18 at 13:05
  • [it might be helpful](https://stackoverflow.com/questions/20375176/should-i-always-use-a-parallel-stream-when-possible) – Andrew Tobilko Jul 23 '18 at 13:06
  • The formula at the bottom of [this answer](https://stackoverflow.com/a/39066952/5515060) may help – Lino Jul 23 '18 at 13:06
  • @Lino its just for filtering. No other actions – Maurice Jul 23 '18 at 13:06
  • 1
    In the second (and third) answer of the question you linked by Brian Goetz he talks about the NQ model, where N is the amount and Q the computation per element. So as mentioned by @Lino it really depends on what each iteration does. For low-Q computations it can indeed be in the millions, but for high-Q computations it could only be a couple thousand. Unfortunately there isn't a clear answer. It depends on what you want to do, and after that it's basically doing different kind of performance tests. – Kevin Cruijssen Jul 23 '18 at 13:07
  • well its just for filtering so thats not really CPU intensive i suppose. Guess i'll go with the sequential stream then – Maurice Jul 23 '18 at 13:08

2 Answers2

4

According to this and depending on the operation it would require at least 10_000, but not elements; instead N * Q where N = number of elements and Q = cost per element.

But this is a general formula you push against, without measuring this is close to impossible to say (read guess here); proper tests will prove you wrong or right.

For some simple operations, it is almost never the case when you would actually need parallel processing for the purpose of speed-up.

Some other things to mention here, is that this heavily depends on the source - how easy it is to split. Anything array-based or index-based are easy to split (and fast), but a Queue or lines from a File do not, so you will probably lose more time splitting rather than computing, unless, of course, there are enough elements to cover for this. And enough is something you actually measure.

Eugene
  • 117,005
  • 15
  • 201
  • 306
0

from 'Modern java in Action': "Although it may seem odd at first, often the fastest way to filter a collection...is to convert it to a stream, process it in parallel, and then convert it back to a list"

HellishHeat
  • 2,280
  • 4
  • 31
  • 37