4

From what i understand, Set in java is an unordered collection and an iterator will process the items in some certain order of its choice(I might be wrong here) but ensures it process all the elements in the set.

In Java8, stream() API in Collections has been introduced with skip and limit functionality. So i want to know if the order of the items processed from stream remain same irrespective of how many times i start a stream or will it be random everytime ? Will the order change if the set gets modified in between the streams ?

May be irrelevant but i am providing the problem here : Now coming to the problem, I have a Set of size 2000 or something which wont be modified post creation and i am doing a batch operation for 50 involving a network call for each batch. I have a start param which gets incremented by 50 post every batch call. If i use a stream over my Set with "start" as the skip argument for every batch, it would be a new stream for every batch right ? So is the order of the stream gaurenteed to remain same. Obviously i dont wont the same entry multiple times and more importantly i dont wont to miss out on any entry. Simplest thing is for me to an Arraylist but i want to know if it is really required for me to create a set.

Biscuit Coder
  • 477
  • 4
  • 12
  • @holi-java To sum up my question, i want to understand if the order is preserved when i do a set.stream() operation multiple times. If the order changes frequently, i cant really rely on skip and limit methods in stream API to do a batch processing over all elements. – Biscuit Coder Jul 20 '17 at 14:20
  • 2
    You shouldn't! Maybe today you run some tests and find that order is preserved, but this is not guaranteed by the spec. If tomorrow the jdk team changes the implementation, your code might start to fail. If you want to guarantee orderness, you need a `List`, that's what they are meant for. – fps Jul 20 '17 at 14:25
  • 1
    @FedericoPeraltaSchaffner Agreed. Irrespective of the tests, its not something that is gaurenteed. So too risky to use it. I will be creating an ArrayList for the same. – Biscuit Coder Jul 20 '17 at 14:27
  • 1
    See [followup question](https://stackoverflow.com/q/45222328/1441122) and [my answer](https://stackoverflow.com/a/45231744/1441122). – Stuart Marks Jul 21 '17 at 07:39

2 Answers2

9

Let's start with an example here. First the obvious one I think:

List<String> wordList = Arrays.asList("just", "a", "test");

    Set<String> wordSet = new HashSet<>(wordList);

    System.out.println(wordSet);

    for (int i = 0; i < 100; i++) {
        wordSet.add("" + i);
    }

    for (int i = 0; i < 100; i++) {
        wordSet.remove("" + i);
    }

    System.out.println(wordSet);

Output will show a different "order" - because we have made the capacity bigger (via 1-100 addition) and entries have moved. They are still 3 there - but in different order (if such can be called order).

So, yes, once you modify your Set between stream operations, the "order" could change.

Since you say that post creation the Set will not be modified - the order is preserved at the moment, under the current implementation (whatever that is). Or more accurately it is not internally randomized - once entries are laid into the Set.

But this is absolutely something not to rely one - ever. Things can change without notice, since the contract is allowed to do that - the docs don't make any guarantees about any order what-so-ever - Sets are about uniqueness after all.

To give you an example the jdk-9 Immutable Set and Map do have an internal randomization and the "order" will change from run to run:

Set<String> set = Set.of("just", "a", "test");
System.out.println(set);

This is allowed to print:

 [a, test, just] or [a, just, test]

EDIT

Here is how the randomization pattern looks like:

/**
 * A "salt" value used for randomizing iteration order. This is initialized once
 * and stays constant for the lifetime of the JVM. It need not be truly random, but
 * it needs to vary sufficiently from one run to the next so that iteration order
 * will vary between JVM runs.
 */
static final int SALT;
static {
    long nt = System.nanoTime();
    SALT = (int)((nt >>> 32) ^ nt);
}

What this does:

take a long, XOR the first 32 bits with the last 32 bits and take the last 32 bits from that long (by casting to int). XOR is used because it has a 50% zeroes and ones distribution, so it does not alter the result.

How is that used(for a Set of two elements for example):

// based on SALT set the elements in a particular iteration "order"
if (SALT >= 0) {
   this.e0 = e0;
   this.e1 = e1;
} else {
   this.e0 = e1;
   this.e1 = e0;

My guess on the jdk9 internal randomization part, initially taken from here, the relevant part:

The final safety feature is the randomized iteration order of the immutable Set elements and Map keys. HashSet and HashMap iteration order has always been unspecified, but fairly stable, leading to code having inadvertent dependencies on that order. This causes things to break when the iteration order changes, which occasionally happens. The new Set/Map collections change their iteration order from run to run, hopefully flushing out order dependencies earlier in test or development

So it's basically to break all that code that would rely on order for a Set/Map. The same thing happened when people moved from java-7 to java-8 and were relying on HashMap's order (LinkedNodes), that was different due to TreeNodes introduction. If you leave a feature like that and people rely on it for years - it's hard to remove it and perform some optimizations - like HashMap moved to TreeNodes; because now you are forced to preserve that order, even if you don't want to. But that is just a guess obviously, treat it as such please

Eugene
  • 117,005
  • 15
  • 201
  • 306
  • Hi Eugene! Do you have details regarding jdk9 randomization of immutable sets/maps? Maybe a link to some post or just the class where this randomization is implemented. And more important, do you have any idea of why the jdk team did it this way? – fps Jul 20 '17 at 14:20
  • which means the random order occurs in different JVMs rather than in same JVM? – holi-java Jul 20 '17 at 14:26
  • @holi-java exactly – Eugene Jul 20 '17 at 14:26
  • 2
    @FedericoPeraltaSchaffner I edited the code to show how it is implemented. – Eugene Jul 20 '17 at 14:28
  • @Eugene Thank you very much for that snippet – fps Jul 20 '17 at 14:29
  • 2
    @FedericoPeraltaSchaffner the `whys` are not easy and honestly I can only think of the case to better accommodate the spec (that says no order is guaranteed)... – Eugene Jul 20 '17 at 14:29
  • 2
    @FedericoPeraltaSchaffner there's one more piece of code there that is really interesting, *at least* for me. `ImmutableCollections.SetN` - how the slot in the array is chosen... – Eugene Jul 20 '17 at 14:35
  • @Eugene If the spec doesn't guarantee order, it doesn't mean that it has to be randomized, otherwise the spec would say so. Anyways, I'm speculating here, maybe it's worth asking the question (if it hasn't been asked yet). Perhaps Stuart Marks or someone from the jdk team might shed some light here. Thank you all, guys! – fps Jul 20 '17 at 14:37
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/149731/discussion-between-federico-peralta-schaffner-and-eugene). – fps Jul 20 '17 at 14:37
  • 5
    @Eugene Your guess about the motivation for the randomization is correct. I remember reading that on the core-libs dev mailing list at that time. Even the OpenJDK test code had a lot of these dependencies and it was quite some work to iron this out. They wanted to avoid to fall in the same trap again. – Stefan Zobel Jul 20 '17 at 15:40
  • 4
    Great answer regarding the iteration order, but it’s also important to emphasize that `skip` and `limit` on a stream are not bound to the iteration order anyway, once the stream is unordered. – Holger Jul 20 '17 at 20:31
5

There are two aspects here. As Eugene correctly pointed out, you can’t assume that a HashSet’s iteration order stays the same—there is no such guaranty.

But the other aspect is the Stream implementation which is not required to maintain the iteration order when the Spliterator doesn’t report an ORDERED characteristic.

In other words, if a stream is unordered, skip(1) is not required to skip the first element, as there is no “first” element, but just to skip one element.

While streams are unlikely to implement randomization, they try to exploit characteristics to minimize the work. A plausible scenario would be that a Stream implementation will treat skip(n) for an unordered but SIZED source just like limit(size-n) as that will also effectively skip n elements, with less work.

Such optimization might not happen today, but in the next version, breaking your batch processing scenario, even in the case that the HashSet’s iteration order does not change.

Holger
  • 285,553
  • 42
  • 434
  • 765