Java streams: find first item different between two streams

Question

Assume I have two streams that have similar but not necessarily identical content. How do I find the first item that is different between the two. For instance, assuming I have:

Stream<String> first = Arrays.asList("a", "b", "c").stream();
Stream<String> second = Arrays.asList("a", "x", "c").stream ();

how do I determine that "b" and "x" are the first items at which the streams differ? And what about if one of the streams is truncated?

You'd likely wind up consuming an entire stream or merging them together to figure something like this out. Is there a specific reason why you have streams instead of a collection? That is to say, do you know the size of the data you're working with, and is it guaranteed to be a specific, certain size? — Makoto, Oct 12 '16 at 20:40
Streams can be ordered, but conceptually it's best to think of them as unordered. A lack of ordering allows for parallelization. The operation you want to perform is better suited for an `Iterator` or `Iterable` than a `Stream`. — John Kugelman, Oct 12 '16 at 20:41
I wanted to work with streams because I was planning on taking unordered data and using the stream sort operation to sort the data without it being inefficient when the data is already sorted (e.g. if it comes from a SortedSet). If there's another way of doing this that would work efficiently if the sources can be either `HashSet` or `TreeSet` instances, I'm happy to do it that way. — Periata Breatta, Oct 12 '16 at 20:43
(I'm making an assumption, I guess, that TreeSet.stream().sorted() is a no-op... I haven't checked that) — Periata Breatta, Oct 12 '16 at 20:45
It's not specified. But it's also not going to be any better or more efficient than dumping the elements into a `List` and calling `List.sort`. — Louis Wasserman, Oct 12 '16 at 20:46
@LouisWasserman - are you sure about that? I'm looking at the code [here](http://grepcode.com/file/repository.grepcode.com/java/root/jdk/openjdk/8-b132/java/util/stream/SortedOps.java#SortedOps.OfRef) and it certainly looks like if the source data set has the SORTED attribute (which is the case for SortedSet implementations) the operation simply works as a passthrough. Hold on, I'll just run a test on that... — Periata Breatta, Oct 12 '16 at 20:58
I said it's not specified, not that it's not implemented that way. But it sounds like your original data is unsorted, so the cost will be equal whether you go through a TreeSet or not. — Louis Wasserman, Oct 12 '16 at 20:58
Well, ok, but if it *is* implemented that way, I get the speed benefits from it. Whereas if it isn't, I still don't get them doing it any other way, unless I implement special cases for sorted sets, which will be a nightmare. — Periata Breatta, Oct 12 '16 at 20:59
My original data is supplied by end-users of a library I'm working on; I don't know whether it's sorted or not, so I need to support both cases. — Periata Breatta, Oct 12 '16 at 21:00
Out of interest, performing `s.stream().sorted().mapToInt(Object::hashCode).sum()` is around 10 times faster on a set of 1 million strings for `TreeSet` versus `HashSet` on my machine. Not perhaps as much difference as I expected, but it is clearly a saving. — Periata Breatta, Oct 12 '16 at 21:32
@PeriataBreatta the difference will probably be even more noticeable if you do a `findFirst()` instead of `mapToInt(…).sum()`. — Didier L, Oct 13 '16 at 16:22
@Didier L: in theory, a combination of `sorted` and `findFirst` could get converted to `min` like operation. So it’s reasonable to use a terminal operation that processes all elements, to be sure. It’s worth noting that the `sorted` optimization for sorted sources is quite fragile. It works only for the no-arg `sorted()` and a `null` comparator source. It’s not capable of recognizing that `Comparator.naturalOrder()` is equivalent to the `null` comparator nor when two custom comparators are the same… — Holger, Oct 14 '16 at 11:10
@Holger what's important is calling an operation that makes sure the sorting happens. Processing all elements adds a huge bias to the benchmark. Anyway, I doubt Periata followed [the rules for correct benchmarks](http://stackoverflow.com/questions/504103/how-do-i-write-a-correct-micro-benchmark-in-java) here ;-) — Didier L, Oct 14 '16 at 13:30

score 5 · Accepted Answer · answered Oct 12 '16 at 20:43

5

This is not a use case where streams can do anything useful for you. There is no built-in way to do this operation that is better than converting the streams to Iterators and comparing them in a more traditional style, or not using streams in the first place.

To do that more traditionally, with two Collections, I'd write

Iterator<String> itr1 = collection1.iterator();
Iterator<String> itr2 = collection2.iterator();
for (int i = 0; itr1.hasNext() && itr2.hasNext(); i++) {
  if (!itr1.next().equals(itr2.next())) {
    System.out.println(i);
    break;
  }
}

answered Oct 12 '16 at 20:43

Louis Wasserman

191,574
25
345
413

The streams *are* useful because they let me get a sorted data source without losing efficiency if the data source is already sorted, *but* you are right that performing the comparison using iterators is easier. Fortunately, I can get iterators out of the streams, so I can use the iterator-based code for consuming the data, but use the streams API to get the sorted iterators in the first place. – Periata Breatta Oct 12 '16 at 21:44

score 3 · Answer 2 · answered Oct 12 '16 at 21:02

A stream works on the inside of data, whereas an iterator works on the outside. With two walks (stream or iterator) one should at least have one on the outside.

As you have to repeat creating the second stream for every lambda of the first.

A full Stream solution is possible if the both streams maintain the same order. A bit like:

EqualTester tester = new EqualTester(
    () -> list1.stream().allMatch(x -> tester.queueAndWait(x)),
    () -> list2.stream().allMatch(x -> tester.queueAndWait(x))
);
boolean eq = tester.getFuture();

where you create threads of two Runnables, that place their next value in a "comparison input", and with both inputs true/false is yielded.

Feasible but a simple algorithm turned inside out. Maybe something for an interview question.

Java streams: find first item different between two streams

2 Answers2