In which cases Stream operations should be stateful?

Question

In the javaodoc for the stream package, at the end of the section Parallelism, I read:

Most stream operations accept parameters that describe user-specified behavior, which are often lambda expressions. To preserve correct behavior, these behavioral parameters must be non-interfering, and in most cases must be stateless.

I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?

I mean, I know it is possible, specially when using sequential streams, but the same javadoc clearly states:

Except for operations identified as explicitly nondeterministic, such as findAny(), whether a stream executes sequentially or in parallel should not change the result of the computation.

And also:

Note also that attempting to access mutable state from behavioral parameters presents you with a bad choice with respect to safety and performance; [...] The best approach is to avoid stateful behavioral parameters to stream operations entirely; there is usually a way to restructure the stream pipeline to avoid statefulness.

So, my question is: in which circumstances is it a good practice to use a stateful stream operation (and not for methods working by side-effect, such as forEach)?

A related question could be: why are there operations working by side effect, such as forEach? I always end up doing a good old for loop to avoid having side-effects in my lambda expression.

score 3 · Answer 1 · answered Oct 10 '15 at 20:43

3

Examples of stateful stream lambdas:

collect(Collector): The Collector is by definition stateful, since it has to collect all the elements in a collection (state).
forEach(Consumer): The Consumer is by definition stateful, well except if it's a black hole (no-op).
peek(Consumer): The Consumer is by definition stateful, because why peek if not to store it somewhere (e.g. log).

So, Collector and Consumer are two lambda interfaces that by definition are stateful.

All the others, e.g. Predicate, Function, UnaryOperator, BinaryOperator, and Comparator, should be stateless.

answered Oct 10 '15 at 20:43

Andreas

154,647
11
152
247

2

`collect` didn't have to be designed as stateful; i.e. `accumulator` could have been a `(A,T)->T` function which can be stateless. It was probably due to some practical considerations, e.g. being able to write `List::add` as accumulator.... – ZhongYu Oct 10 '15 at 21:12
4

I don't think `peek(x->log(x))` would be considered "stateful" in this context. Inserting logging in the `func` of `map(func)` wouldn't be considered stateful either. The word `stateful` needs better definition here. – ZhongYu Oct 10 '15 at 21:16
2

You seem to confuse "being stateful" with "having side-effects". – a better oliver Oct 11 '15 at 09:21
@bayou.io: if you have a function of the form `(A,T)->T`, you can use [`reduce`](https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#reduce-U-java.util.function.BiFunction-java.util.function.BinaryOperator-). But there are operations, where the requirement to always return a new object on each function evaluation can have an unacceptable performance impact and `collect` is exactly for supporting this use case. This is not so much to support `List::add` as that is already hidden within `Collectors.toList()`… – Holger Oct 12 '15 at 10:47
@Holger - `(A,T)->T` can be stateless, or stateful (mutating and returning the same T). It seems that for collector's accumulator, `(A,T)->T` would be more general/flexible than `(A,T)->void`. So it's interesting why the `void` version was chosen. – ZhongYu Oct 12 '15 at 11:32
@bayou.io: I’m not sure, what you are trying to tell me. `reduce` *requires* the functions *not* to mutate the incoming objects. If you can fulfill this requirement, there is no need for `collect`. In other words supporting *mutable reduction* is the *only* purpose of `collect`. That’s why the method signatures clearly reflect this. They enforce the programmer to be aware of whether they perform a “clean” reduction or a mutable reduction. Behind the scenes, the `Collector` interface abstracts both, to allow code sharing inside the implementation. – Holger Oct 12 '15 at 11:42
@Holger - `collect` can be viewed as expansion of `reduce`; I'm not convinced `collect` should be constrained so that it cannot cover `reduce`. It is pretty odd that an API is purposefully designed to exclude stateless/functional way of programming. – ZhongYu Oct 12 '15 at 14:05
@bayou.io: why should `collect` cover the use case of `reduce` when there is already the method `reduce` which covers the use case of `reduce`? I don’t get your point. Why do we have multiple methods at all? We could provide one method covering all operations in one… – Holger Oct 12 '15 at 14:07
Collector.combiner can be either stateful or stateless; same could have been true for accumulator. However, in [Stream.collect(..., combiner)](https://docs.oracle.com/javase/8/docs/api/java/util/stream/Stream.html#collect-java.util.function.Supplier-java.util.function.BiConsumer-java.util.function.BiConsumer-), `combiner` is specifically limited to stateful ones; and the only reason appear to be being able to use `ArrayList::addAll` etc as combiner argument. – ZhongYu Oct 12 '15 at 14:07
Why can't `reduce` be a special case of `collect`? :) That's what I'd do. It's common that an API allows mutability; but it's very odd that an API requires mutability when it doesn't have to. – ZhongYu Oct 12 '15 at 14:11
@bayou.io: you are right, the main reason to make the combiner a biconsumer is to allow using methods like `List.addAll` or in other words, any method of the form `instance.methodName(arg)` which will modify `instance` according to `arg`—not uncommon. Besides said, as already said, internally `reduce` and `collect` *are* two variants of the same operation. See [`Collector.of(…)`](https://docs.oracle.com/javase/8/docs/api/java/util/stream/Collector.html#of-java.util.function.Supplier-java.util.function.BiConsumer-java.util.function.BinaryOperator-java.util.stream.Collector.Characteristics...-)… – Holger Oct 12 '15 at 14:14
btw, do you happen to know why the Collector interface is designed to have methods like `Supplier supplier()`, instead of just `A supply()`? That is quite peculiar. If everybody programs like that, we won't have `Object.toString()->String`, we'll have `Object.stringer()->Supplier` :( – ZhongYu Oct 12 '15 at 14:19
@bayou.io Notice that `Collector` is actually a factory interface, even though they never call it that. I believe they wanted the ability to use the `Collector.of()` static method to create collectors, reusing existing methods for all 4 facets: `supplier`, `accumulator`, `combiner`, and `finisher`. Otherwise you'd have to implement a new `Collector` for every "collection". – Andreas Oct 12 '15 at 16:40
So, in a `FooFactory`, we should not have method `Foo makeFoo()`; instead, the method should look like `Supplier fooMaker()`? :) – ZhongYu Oct 12 '15 at 16:42
@bayou.io No, because `FooFactory` is a factory design to create `Foo` *objects*. `Collector` is not a factory for creating collections, but a factory of *functions* for creating collections. The 4 factory methods return a function, and the 4 functions work together to create a collection. – Andreas Oct 12 '15 at 16:50
well, methods are already functions... any interface is already a set of functions. – ZhongYu Oct 12 '15 at 16:53
@bayou.io He means functions that are functional interfaces, from `java.util.function` ;) Interesting discussion. So I quite well understand now why `Collector` is stateful, but I don't see why `forEach` couldn't be stateless. – FBB Oct 13 '15 at 23:52
@FBB - I'll put my understanding in an answer shortly. – ZhongYu Oct 14 '15 at 00:24
@Holger, btw I don't see where it's specified that `reduce` cannot modify the previous accumulator. – Tagir Valeev Oct 14 '15 at 08:11
@bayou.io, actually the very first version of [Collector](http://hg.openjdk.java.net/jdk8/jdk8/jdk/file/2a78d8f1fec1/src/share/classes/java/util/stream/Collector.java) interface was intended to support `reduce` scenario (there was `STRICTLY_MUTATIVE` characteristic which signals that accumulator function always mutates the existing accumulator instead of creating one). Later this was removed. – Tagir Valeev Oct 14 '15 at 08:20
1

@Tagir Valeev: the `accumulator` parameter of `reduce` is a *function*. There is no “previous accumulator”. If you mean something like a mutable container instead, well, it doesn’t need to be specified, as there is no such thing in the `reduce` signature. If you are talking about the three-arg form of `reduce`, the first argument is an *identity value*, thus inherently forbids mutation. Besides that, I think [this documentation](http://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html#Reduction) does not leave room for doubts. Compare with “Mutable reduction” beneath it… – Holger Oct 14 '15 at 08:28
@bayou.io: the documentation says “Collectors are designed to be *composed*” which gives a hint for a possible reason: the supplier (as well as all other functions) provided by a collector doesn’t need to be created by that collector. Thus, instead of traversing a deep call chain on each invocation, the function is acquired only once and might be flat, despite being the result of a more complex construction. Granted, this matters more for the accumulator function rather than the supplier, but the `interface` is consistent in this regard. – Holger Oct 14 '15 at 08:50
@Holger - Such internal performance concern does not seem significant enough to influence API design. Can't JVM flatten simple call forwarding anyway? Maybe the designer really likes `supplier()->Supplier` over `supply()->A` from his aesthetic view. – ZhongYu Oct 14 '15 at 12:19
@bayou.io: the JVM can and does, but only to a certain limit and there are several occasions where the developers decided not to rely on the JVM’s optimizations. That doesn’t drive all API designs, but `Collector` is an interface that is not expected to be implemented by application developers (too often). Instead, if you are not using prebuilt collectors, it’s instantiated via `Collector.of(…)` or ad-hoc by calling the three-arg `collect` method on the `Stream`. The whole existence of `IntStream`, `LongStream` and `DoubleStream` tells a story of unaesthetic APIs created only for performance… – Holger Oct 14 '15 at 12:26
@Holger - that's a possibility, but I won't fully buy it without seeing some benchmark. – ZhongYu Oct 14 '15 at 12:33

Denis Bazhenov · Answer 2 · 2015-10-14T08:00:48.797

I have hard time understanding this "in most cases". In which cases is it acceptable/desirable to have a stateful stream operation?

Suppose following scenario. You have a Stream<String> and you need to list the items in natural order prefexing each one with order number. So, for example on input you have: Banana, Apple and Grape. Output should be:

1. Apple
2. Banana
3. Grape

How you solve this task in Java Stream API? Pretty easily:

List<String> f = asList("Banana", "Apple", "Grape");

AtomicInteger number = new AtomicInteger(0);
String result = f.stream()
  .sorted()
  .sequential()
  .map(i -> String.format("%d. %s", number.incrementAndGet(), i))
  .collect(Collectors.joining("\n"));

Now if you look at this pipeline you'll see 3 stateful operations:

sorted() – stateful by definition. See documetation to Stream.sorted():

This is a stateful intermediate operation
map() – by itself could be stateless or not, but in this case it is not. To label positions you need to keep track of how much items already labeled;
collect() – is mutable reduction operation (from docs to Stream.collect()). Mutable operations are stateful by definition, because they change (mutate) shared state.

There are some controversy about why sorted() is stateful. From the Stream API documentation:

Stateless operations, such as filter and map, retain no state from previously seen element when processing a new element -- each element can be processed independently of operations on other elements. Stateful operations, such as distinct and sorted, may incorporate state from previously seen elements when processing new elements.

So when applying term stateful/stateless to a Stream API we're talking more about function processing element of a stream, and not about function processing stream as a whole.

Also note that there is some confusion between terms stateless and deterministic. They are not the same.

Deterministic function provide same result given same arguments.

Stateless function retain no state from previous calls.

Those are different definitions. And in general case doesn't depend on each other. Determinism is about function result value while statelessness about function implementation.

From user's point of view, `sorted()` is not stateful. It does not depend any state other than the input stream, and it makes no state change that is visible to the user. It being labeled "stateful" is from implementer's point of view. — ZhongYu, Oct 14 '15 at 02:21
although, all intermediary operations render the original stream unusable; they are stateful in that sense, if that is a problem to the programmer. — ZhongYu, Oct 14 '15 at 02:24
"retain no state from previous calls" - if `foo()` depends on state mutated by `bar()`, and vice versa, are they still stateless? :) The meaning of a word really depends on usage and context. We throw around words like `stateless` casually, but it's ok because in each context we know what the word tries to categorize. Using "stateless" for `sorted()` is apparently a poor choice of word, but it's ok; it's more for labeling two kinds of operations. Author might as well call them "taste-ful/less" operations, or "black-white", and we still get the meaning. — ZhongYu, Oct 14 '15 at 12:25

Tagir Valeev · Answer 3 · 2015-10-11T02:23:04.813

0

When in doubt simply check the documentation to the specific operation. Examples:

Stream.map mapper parameter:

mapper - a non-interfering, stateless function to apply to each element

Here documentation explicitly says that the function must be stateless.
Stream.forEach action parameter:

action - a non-interfering action to perform on the elements

Here it's not specified that the action is stateless, thus it can be stateful.

In general it's always explicitly written on every method documentation.

edited Oct 11 '15 at 02:23

answered Oct 11 '15 at 01:45

Tagir Valeev

97,161
19
222
334

Hmm, `Stream.map` *could* be stateful, and `Stream.forEach` *could* be stateless, so it doesn't really answer my question, that is, in which cases it is a good practice to use stateful operations, and why? – FBB Oct 13 '15 at 23:54

ZhongYu · Answer 4 · 2015-10-14T00:55:44.623

0

A stateless function returns the same output for the same inputs, "no matter what".

It's easy to create non-stateless functions in an imperative language like Java. e.g.

    func = input -> currentTime();

If we do stream.map(func) with a stateful func, the resulting stream will depend on how func is invoked at runtime; the behavior of the application will be hard to understand (but not that hard).

If func is stateless, stream.map(func) will always produce the same stream, no matter how map is implemented and executed. This is nice and desirable.

Note that "no matter what" implies that a stateless function must be thread-safe.

If a function returns void, isn't it always stateless? Well... there's another connotation of stateless - invoking a stateless function should not have side effects that are "important" to the application.

If func has no "important" side effects, it's safe to invoke func arbitarily. For example, stream.map(func) can safely invoke func multiple times even on the same element. (But don't worry, Stream is never gonna do that).

What is an "important" side effect? That is very subjective.

At the very least, invoking fun will cost some CPU time, which is not exactly free. This might be concerning for performance critical applications; or on expensive platforms (cough AWS).

If func logs something on hardisk, it may or may not be an "important" side effect. (It too costs $$)

If func queries an external service that costs dearly, it is very concerning, it can bankrupt you.

Now, forget about money. Purely from application logic point of view, func could cause mutation to some state that the application depends on; even if func returns the same output for the same inputs, it still cannot be considered "stateless". For example, if in stream.map(func), func adds each element to a list, and later the application uses the list, the resulting list will depend on how func is invoked at runtime. This is frawned upon by functional-programmers.

If we do stream.forEach( e->log(e) ), is it stateless? We can consider it stateless if

we don't care about the cost of log
log() can be invoked concurrently
we don't care about the order of log entries
log entries have no impact on this application's logic

edited Oct 14 '15 at 00:55

answered Oct 14 '15 at 00:40

ZhongYu

19,446
5
33
61

You're confuse terms stateless and deterministic. `currentTimeMillis()` is not deterministic, but stateless. – Denis Bazhenov Oct 14 '15 at 01:40
@DenisBazhenov - replace the example with `input->seq++` – ZhongYu Oct 14 '15 at 01:45
that one indeed stateful. – Denis Bazhenov Oct 14 '15 at 01:47
now - is clock querying stateless? hmm... Look at the application as a whole, it contains a subcomponent (the clock) that updates a counter... I wouldn't say that is functional-programming... – ZhongYu Oct 14 '15 at 01:47
Stateful algorithms keep track of previous interactions with clients. Counters are perfect example of statefull algorithms, bot not clocks. Clocks stores nothing about previous interactions. – Denis Bazhenov Oct 14 '15 at 01:50
clock has a state :) The terms of state-ful/less are indeed used to mean different things in different contexts. We may start from the most strict sense of stateless, and an application can express nothing but universal and eternal truth. But that is useless, who needs that. From there, we will need some state somewhere, and it gets subjective. – ZhongYu Oct 14 '15 at 01:59
Could you provide your definition of statefullness, please? – Denis Bazhenov Oct 14 '15 at 02:13
in the most strict sense, a function is stateful if it depends on anything other than the inputs. – ZhongYu Oct 14 '15 at 02:18
in this sense your definition is equal to the definition of the deterministic function (https://en.wikipedia.org/wiki/Deterministic_algorithm). We do not need two for the job, so just use word _deterministic_ to avoid confusion. And the question was about _statelessness_ which is quite different characteristic. – Denis Bazhenov Oct 14 '15 at 07:45
it can be deterministic and stateful if it mutates observable states. – ZhongYu Oct 14 '15 at 12:28
If it mutates shared state it has side-effects (https://en.wikipedia.org/wiki/Side_effect_(computer_science)). I'm not sure we even need such term as "stateful". It seems to get things more complicated. – Denis Bazhenov Oct 16 '15 at 00:36
By the way, you find will no definition of statefullness/-lessness in wikipedia. I'm not saying it's proving any point, but doesn't it seems strange? – Denis Bazhenov Oct 16 '15 at 00:39

In which cases Stream operations should be stateful?

4 Answers4

Linked