8

I have some code as follows:

Map<RiskFactor, RiskFactorChannelData> updateMap =
    updates.stream().filter(this::updatedValueIsNotNull). // Remove null updated values
        collect(Collectors.toMap(
            u -> u.getUpdatedValue().getKey(), // then merge into a map of key->value.
            Update::getUpdatedValue,
            (a, b) -> b)); // If two values have the same key then take the second value

Specifically I want to take the values from the list and put them into the map. That all works perfectly. My concern though is with ordering.

For example if the list has:

a1, b1, a2

How do I ensure that the final map contains:

a->a2
b->b1

Instead of

a->a1
b->b1

The incoming list is ordered, stream().filter() should have maintained the order but I can't see anything in the documentation of Collectors.toMap about ordering of the inputs.

Is this safe in the general case or have I just been lucky on my test cases so far? Am I going to be JVM dependent and at risk of this changing in the future?

This is very simple to guarantee if I just write a for loop but the "fuzzyness" of potential stream behavior is making me concerned.

I'm not planning to use parallel for this, I'm purely seeking to understand the behavior in the case of a sequential non-parallel stream that reaches to toMap.

Tim B
  • 40,716
  • 16
  • 83
  • 128
  • I don't know if it's safe, but if it's not you could use `Collectors.groupingBy` upstream of your `Collectors.toMap` – Aaron Feb 08 '17 at 15:45
  • Possible duplicate: http://stackoverflow.com/questions/30258566/java-stream-map-and-collect-order-of-resulting-container – assylias Feb 08 '17 at 15:45
  • Or maybe better: http://stackoverflow.com/a/30530572/829571 – assylias Feb 08 '17 at 15:46
  • @assylias First is not a dupe, that doesn't cover toMap – Tim B Feb 08 '17 at 15:47
  • @TimB your question is on `collect` - toMap will just consume whatever collect passes. – assylias Feb 08 '17 at 15:48
  • @TimB so you want to have the "last value" in the map (whatever the list iteration provides)? – Eugene Feb 08 '17 at 15:49
  • Yes, for every element in the list that has the given key I need to take the last value from the list and drop the earlier elements that had that key. The code I've written does that according to a quick test case...but I wasn't sure how "safe" it was or if it might break in edge cases. – Tim B Feb 08 '17 at 15:53
  • @TimB, if you don't need stream capabilities of processing data in arbitrary order, why use streams at all? Just use a simple loop same way you intended. – M. Prokhorov Feb 08 '17 at 15:56
  • The first reason is simple: One line of code instead of many lines. The second is that when I have a tool I prefer to understand the safe limits of using it rather than not trying something because it might (or might not) break. I'd rather know definitively whether it is safe or not. – Tim B Feb 08 '17 at 15:57
  • @TimB this is what you are looking for: even if you do that in parallel still the encounter order is guaranteed http://stackoverflow.com/questions/29709140/why-parallel-stream-get-collected-sequentially-in-java-8 – Eugene Feb 08 '17 at 16:17
  • @Eugene Yes, that sounds like it mostly covers it. "Encounter Order". Thanks, that seems like a useful keyword :) – Tim B Feb 08 '17 at 16:23
  • @Eugene, Do note, that there is a clear distinction. Encounter order and temporal orders are equal for non-parallel processing. However, in parallel processing encounter order and temporal order are not the same: for data source with defined encounter order, after task beign split between different threads, encounter order in each thread will be equal to encounter order in non-parallel processing for the same chunk. However, temporal order in parallel is not guaranteed in any way unless proper sync measures has been taken, as chunks are processed in unspecified order. – M. Prokhorov Feb 08 '17 at 16:53

2 Answers2

5

The term “most recent value” is a bit misleading. Since you want the last value according to encounter order, the answer is that toMap will respect the encounter order.

Its documentation refers to Map.merge to explain the semantics of the merge function, but unfortunately, that documentation is a bit thin too. It doesn’t mention the fact that this function is invoked with (oldValue,newValue) explicitly; it can only be deduced from the code example.

toMap’s documentation further states:

The returned Collector is not concurrent. For parallel stream pipelines, the combiner function operates by merging the keys from one map into another, which can be an expensive operation. If it is not required that results are merged into the Map in encounter order, using toConcurrentMap(Function, Function, BinaryOperator, Supplier) may offer better parallel performance.

So it explicitly directs to a different collector, if encounter order is not required. Generally, all builtin collectors provided by Collectors are only unordered, if explicitly stated, which is only the case for the “…Concurrent…” collectors and the toSet() collector.

Holger
  • 285,553
  • 42
  • 434
  • 765
2

It is safe, Collection.stream() creates a sequential stream.

I suggest to take a look at Collectors.toMap in case of collisions it takes care to choose the correct value. In your case you should use the more recent.

The part you're interested in is (a, b) -> b where you arbitrarily choose the second element, there you should choose the more recent.

I think your problems came from the fact that are not sure about the processing order, in case you want continue to use streams (instead of a for loop) you could enforce this state adding .sequential() after .stream().

Another way, I would prefer, is add a timestamp to the RiskFactorChannelData, and use even a parallel stream.

freedev
  • 25,946
  • 8
  • 108
  • 125
  • right, but the only way I know the more recent is by its position in the original array. – Tim B Feb 08 '17 at 15:48
  • @TimB streams that are build from an array are traversed in the order of the array, I think this is a good answer. – Eugene Feb 08 '17 at 15:51
  • 1
    @TimB, in that case, I suppose you cannot use streams for what you actually want. Imagine yourself putting `parallelStream` there, to speed it up or whatever - this means your stream becomes unordered. What I suggest you do is add something like a timestamp to your update to signify its actual order, so you don't rely on array position. – M. Prokhorov Feb 08 '17 at 15:52
  • @Eugene, yeah, it is *implied* that stream traversal will go in order of array. But it totally breaks if he will go parallel with it at some point in future. And all-in-all I see no benefit to using streams there if what he actually wants is simple for-loop. – M. Prokhorov Feb 08 '17 at 15:54
  • @Eugene, I find that answer invalid in case of `toMap` collector. Parallel stream will split computation into chunks, where **order of items in each chunk** will be the same as in original data source. However, the **order of chunks** is not guaranteed. What if chunk split happens in between two updates with the same key, and "second" chunk will be concurrently processed **before** "first" chunk? Immediate and hard to track bug you'll have. – M. Prokhorov Feb 08 '17 at 16:00
  • @M.Prokhorov totally my bad with the link... here it is: http://stackoverflow.com/questions/29709140/why-parallel-stream-get-collected-sequentially-in-java-8 – Eugene Feb 08 '17 at 16:16
  • @Eugene, your second link just enforced my suspicion: map collector does not enforce temporal order, so what I said about chunks being processed in different order holds true. However, this discussion is kinda pointless since question does not care about parallel. – M. Prokhorov Feb 08 '17 at 16:49
  • 2
    @M. Prokhorov: these chunks will populate different maps, that’s why a collector has a `Supplier` rather than a single `Map` instance. The partial results are merged into a single map afterwards, considering the original encounter order of the chunks. – Holger Feb 08 '17 at 18:08
  • @Holger, will merge apply to chunks in order that would preserve their order respective to chunks position in data source? – M. Prokhorov Feb 09 '17 at 05:50
  • @M. Prokhorov: yes, if neither the stream (or its underlying `Spliterator`) nor the `Collector` report to have an `UNORDERED` characteristics, the merge operations are guaranteed to be performed in an order-preserving manner. This implies that the functions must fulfill the [Associativity](https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html#Associativity) constraint. See also “[Mutable Reduction](https://docs.oracle.com/javase/8/docs/api/java/util/stream/package-summary.html#MutableReduction)”. – Holger Feb 09 '17 at 11:21