I was reading about statelessness and came across this in doc:
Stream pipeline results may be nondeterministic or incorrect if the behavioral parameters to the stream operations are stateful. A stateful lambda (or other object implementing the appropriate functional interface) is one whose result depends on any state which might change during the execution of the stream pipeline.
Now if I have the a list of string (strList
say) and then trying to remove duplicate strings from it using parallel streams in the following way:
List<String> resultOne = strList.parallelStream().distinct().collect(Collectors.toList());
or in case we want case insensitive:
List<String> result2 = strList.parallelStream().map(String::toLowerCase)
.distinct().collect(Collectors.toList());
Can this code have any problem as parallel streams will split the input and distinct in one chunk does not necessarily mean distinct in the whole input?
EDIT (Quick summary of the answers below)
The distinct
is a stateful operation and in case of stateful intermediate operations parallel streams may require multiple passes or substantial buffering overheads. Also distinct
can be implemented more efficiently if ordering of elements is not relevant.
Also as per doc:
For ordered streams, the selection of distinct elements is stable (for duplicated elements, the element appearing first in the encounter order is preserved.) For unordered streams, no stability guarantees are made.
But in case of ordered stream running in parallel distinct may be unstable - means it will keep an arbitrary element in case of duplicates and not necessarily the first one as expected from distinct
otherwise.
From the link:
Internally, the distinct() operation keeps a Set that contains elements that have been seen previously, but it’s buried inside the operation and we can’t get to it from application code.
So in case of parallel streams it would probably consume the entire stream or may use CHM (sth like ConcurrentHashMap.newKeySet()
). And for ordered ones most likely it would be using LinkedHashSet
or similar contruct.