4

I'm confused by the way Java streams work, particularly as regard to short-circuiting. For an example of what confuses me, I cooked up the following example:

List<Integer> list = Arrays.asList(1,2,3,4,5,6,7,8,9,10);
Optional<Integer> res = list.stream()
        .map(x -> {
            System.out.println("first map: " + x);
            return 2*x;
        })
        .sorted((a,b)-> {
            System.out.println("sorting " + a + " : " + b);
            return a - b;
        })
        .map(x -> {
            System.out.println("second map: " +  x);
            return 2*x;
        })
        .findAny();
System.out.println("found " + res.get());

with output

first map: 1
first map: 2
first map: 3
first map: 4
first map: 5
first map: 6
first map: 7
first map: 8
first map: 9
first map: 10
sorting 4 : 2
sorting 6 : 4
sorting 8 : 6
sorting 10 : 8
sorting 12 : 10
sorting 14 : 12
sorting 16 : 14
sorting 18 : 16
sorting 20 : 18
second map: 2
found 4

When executed, this code demonstrates that calling sorted in the middle forces the first map to apply to every element in the stream. However, the second map does not get applied to every element in the stream since findAny short-circuits it.

So basically my question is: what are the rules here? Why is Java smart enough to know it doesn't need to call the second map on every element of the stream but not smart enough to know that findAny at the end doesn't require it to actually sort anything.

I tried "reading the manual" on this, and it just isn't clear to me.

John Kugelman
  • 349,597
  • 67
  • 533
  • 578
  • 5
    It can't sort the stream without checking the value of every element. But it can find one element just by getting the value of one element. – khelwood Jan 19 '22 at 14:31
  • See https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html#StreamOps – khelwood Jan 19 '22 at 14:32
  • But why doesn't it know it doesn't need to sort the stream? It only has to pull one element –  Jan 19 '22 at 14:36
  • 2
    Presumably the implementers didn't foresee a use case of someone explicitly specifying that the stream would be sorted before using `findAny()` and not wanting the stream to be sorted. You are trying to get something from the output of `sorted`; and in order to allow that, everything up to the `sorted()` must be executed, which requires evaluating every element. – khelwood Jan 19 '22 at 14:54
  • Even if you use `findAny`, it does not mean that you do not want to sort *all* elements before. Imagine you just want to find the smallest element that matches a condition. – Arnaud Denoyelle Jan 19 '22 at 15:01
  • 3
    [From the Javadocs](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/stream/package-summary.html) of the `java.util.stream` package: *"Stateful operations may need to process the entire input before producing a result."* – MC Emperor Jan 19 '22 at 15:28
  • @ArnaudDenoyelle you should use `findFirst()` for that use case. I agree whith khelwood though: the implementers didn’t care to optimize for the case where you ask for sorting but don’t care about the sorting result. – Didier L Jan 19 '22 at 15:31
  • I suppose what bothers me about this is I imagine the computation going upward from the terminal operation to the stream above it, and going up again as needed. Evidently, that is not what happens, and it strikes me as anti-intuitive. But I grant that my thinking on this may not be entirely clear. That's why I asked the question in the first place. Thanks. –  Jan 19 '22 at 16:25
  • 1
    The rule is you should use pure functions with streams, so that it doesn't matter which of your functions are called in what order - because the stream API generally doesn't make guarantees about that. – kaya3 Jan 19 '22 at 20:14
  • 3
    @Adam, that is in fact a pretty good description of what happens. `findAny()` asks for one element from its source stream. To get that, the source stream (produced by `map()`) has to get one from its own source stream, which is the result of `sorted()`. That stream doesn't provide *any* elements until it has sorted its whole source stream. None of the streams know anything about the context of any downstream request. – John Bollinger Jan 19 '22 at 20:56
  • 1
    In particular, the non-terminal operations are not modifying a single stream in-place. Rather, they each produce a new stream. – John Bollinger Jan 19 '22 at 21:01
  • 2
    @Adam you imagine correctly. It goes “upward from the terminal operation”, the question is, how much information does it carry upwards and to what degree do the operations above react on it. That’s just implementation specific. I hope [my answer](https://stackoverflow.com/a/70783950/2711488) makes it clear, especially the point that the implementation may change. – Holger Jan 20 '22 at 09:49
  • @JohnBollinger That helps a lot actually. Thank you for commenting. –  Jan 20 '22 at 15:21
  • I found both answers useful to me, but since I could only choose one I chose the one that made the most sense to me. –  Jan 20 '22 at 15:51

2 Answers2

3

That happens because streams are processing elements from the source lazily one at a time.

Each operation occurs only when it's needed.

In the case of sorting, stream dumps all the data from the source to memory because there's no way to sort elements peeking then one by one.

map --> sorted --> map --> findAny

The First-time map operation gets applied to all elements because it precedes the sorting operation.

After sorting is done, elements will be processed one at a time.

Terminal operation findAny is a short-circuit operation. And hence only the very first of all the elements will be peeked by the terminal operation, the second map operation needs to be applied only once.


Why is Java smart enough to know it doesn't need to call the second map on every element of the stream but not smart enough to know that findAny at the end doesn't require it to actually sort anything

Well, I guess because until now findAny() in the case of single-threaded execution still behaves exactly like findFirst(), which will require to do sorting in order to find the correct result.

You may try to play with sequential streams and find out that findAny() and findFirst() yield the same result.

I can't give you the answer to why it behaves in such away.

Java-doc says with caution about findAny()

The behavior of this operation is explicitly nondeterministic

Only if stream is being processed in parallel behavior of findAny() differs from findFirst().

findAny() could return an element that is peeked by any thread, while findFirst() guarantees to respect the initial order while providing the result.


https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/stream/Stream.html

Alexander Ivanchenko
  • 25,667
  • 5
  • 22
  • 46
  • Note that "non-deterministic" doesn't mean "unpredictable" in this context, it means the spec doesn't guarantee the behaviour exactly, so different implementations may have different behaviour. Each implementation can still be predictable. – kaya3 Jan 19 '22 at 20:17
  • @kaya3 thanks for the clarification – Alexander Ivanchenko Jan 19 '22 at 20:22
  • About how smart (or not) Java is: I would not characterize it at all in terms of threads or `findFirst()`. It's simply that to get any element from the second `map()` stream, that stream has to get at least one element from the `sorted()` stream, and the latter does not provide any elements at all without sorting its whole source stream. It all comes back to the short-circuiting: that's all Java is doing, not any analysis of the specific semantics of the various streams. It can't and doesn't cut any streams out of the chain. – John Bollinger Jan 19 '22 at 20:50
  • @JohnBollinger thanks for your explanation, I've spent some time meditating about how it fits in my understanding of the 'java universe'. Now the overall picture seems clear: the only optimization that is possible with streams boils down to the decision: to execute the pipeline or not. Like with count() operation, which in the case if none of the intermediate operations changes the number of elements, simply interrogates the source of data instead of triggering the execution of the pipeline. But if the pipeline has to be processed there is no means to break the chain of operations. – Alexander Ivanchenko Jan 19 '22 at 21:42
2

This is indeed not obvious because the actual reason is rather historical. The first release version of Java 8 had a Stream implementation which did consider the unordered nature of findAny, at least in some cases.

Not in the case of a skippable sorted but in more cases than correct. This has been discussed in

In the latter Q&A we find a comment from Brian Goetz, one of the architects:

After some analysis we chose to back out back-propagation entirely. The only place where this pays off is optimizing away a sort; if you have a pipeline that sorts into an unordered terminal op, that's probably a user error anyway.

So the current state of the reference implementation is that the unordered nature of the terminal operation is not used for optimizing preceding intermediate operations.

This doesn’t imply that such optimization wouldn’t be valid; the example of skipping a sort when the subsequent operations do not care about the order is explicitly mentioned as valid, though Brian Goetz suggests that such a scenario might be a mistake on the developer’s side anyway.

I have a slightly different point of view, as I think it would be a useful scenario if you have Stream returning method, which may return a sorted Stream, while operations chained by the caller determine whether this sorting is relevant for the final result or not.

But whether such an optimization happens or not does not affect the correctness of the result. Note that this does not only apply to unordered operations.

When you have

Optional<Integer> res = IntStream.rangeClosed(1, 10)
    .mapToObj(x -> {
        System.out.println("first map: " + x);
        return 2*x;
    })
    .sorted((a,b)-> {
        System.out.println("sorting " + a + " : " + b);
        return Integer.compare(a, b);
    })
    .map(x -> {
        System.out.println("second map: " +  x);
        return 2*x;
    })
    .findFirst();

it would be perfectly valid to transform the operation to (the equivalent of)

Optional<Integer> res = IntStream.rangeClosed(1, 10)
    .mapToObj(x -> {
        System.out.println("first map: " + x);
        return 2*x;
    })
    .min((a,b)-> {
        System.out.println("sorting " + a + " : " + b);
        return Integer.compare(a, b);
    })
    .map(x -> {
        System.out.println("second map: " +  x);
        return 2*x;
    });

which would be more efficient and still return the correct result for the terminal operation while the evaluation order of the first mapping function and the comparator is different.

But the correct result is all that matters. You should never assume a specific evaluation order at all. You don’t find a specification of the evaluation order in the manual, because it has been left out intentionally. The presence or absence of optimizations can change at any time. If you want a dramatic example, replace your findAny() with count() and compare your example’s behavior in Java 8 and Java 9 (or newer).

// run with Java 8 and Java 9+ and compare...
long count = IntStream.rangeClosed(1, 10)
    .mapToObj(x -> {
        System.out.println("first map: " + x);
        return 2*x;
    })
    .sorted((a,b)-> {
        System.out.println("sorting " + a + " : " + b);
        return Integer.compare(a, b);
    })
    .map(x -> {
        System.out.println("second map: " +  x);
        return 2*x;
    })
    .count();
System.out.println("found " + count + " elements");
Holger
  • 285,553
  • 42
  • 434
  • 765
  • 1
    *"I think it would be a useful scenario...the caller determine whether this sorting is relevant"* – Good point. The particular user does not always have control over the entire stream. – MC Emperor Jan 20 '22 at 11:44