What's the time complexity of Java Stream distinct().sorted()?

Question

Every time when I get a coding interview, I always avoid using Java stream, because I can't analyze the time complexity very well.

For example: in my daily work, I might write like this:

Arrays.stream(a).distinct().sorted().toArray();

to get the unique number and sort them.

but I'm curious about the time complexity will be..? is distinct().sorted will become an nested loop?

should I need to seperate them?

int[] arr = Arrays.stream(a).distinct().toArray();
Arrays.stream(arr).sorted().toArray();

so sometimes when I have an interview, I'll use set to distinct then sort them.... but I really want to write a clean code...

If anyone can help! Thank you!

Sorting complexity is probably O(n*log(n)) (depends on the Java version you're using). Finding unique numbers is O(n) in worst-case (if all numbers are different). — Maroun, Mar 25 '21 at 11:22
`distinct` can be described as a *stateful* filter. Its time complexity is *probably* *O(n)*. Each element must be examined if it has already been seen. Internally, this would probably be implemented with a `Set`, which has a time complexity of *O(1)*. — MC Emperor, Mar 25 '21 at 11:28
*"... a coding interview, I always avoid using Java stream"* - this is very wrong. — Nikolas Charalambidis, Mar 25 '21 at 11:39
Ignore parallel streams, for `distinct()` it uses a `HashSet` for unsorted streams (_O(n)_ amortized) and simply compares adjacent values for sorted streams (_O(n)_). For `sorted()`, performance is generally _O(n*log(n))_. Which means that the overall performance is **_O(n*log(n))_**. — Andreas, Mar 25 '21 at 11:42
"*when I get a coding interview, I always avoid using Java stream, because I can't analyze the time complexity very well*" As an interviewer, I look for people who can write readable code above almost all else. If something is inefficient, and I think that inefficiency matters, I can decide to ask the candidate whether there's any other approach they could have taken, and why they leaned towards the one they chose. I think as a general rule it's a bad idea to try to alter your behaviour to what you *think* they want. Just do what comes naturally. — Michael, Mar 25 '21 at 11:51
I don't know! Because last time, an interviewer told me, the time complexity of distinct().sorted() will be O(n)*O(nlogn) = O(n^2logn), that's why I so confused... Now I think he is wrong, if this is a nested loop, the logic will be wrong! The distinct() should be done before sorted()? right~? — Timmy Lin, Mar 25 '21 at 13:05
That is "kind-of" correct, but confusing. It depends on what exactly you mean by "distinct() should be done before sorted()". To be precise, `distinct()` takes a stream and returns another stream. The stream returned by `distinct()` is one which will cause every element retrieved from it to be filtered for distinctness. The retrieval of these elements will be done by the `sort()` method. In any case, the `.distinct().sorted()` time complexity will be O(n log n). Now here's a question for you to consider: which is fastest: `.distinct().sorted()` or `.sorted().distinct()`? — k314159, Mar 25 '21 at 14:39
If your interviewer said it's O(n^2 log n) then your interviewer was wrong and probably shouldn't be interviewing people. — kaya3, Mar 25 '21 at 17:14

Holger · Accepted Answer · 2021-03-25T17:23:16.980

There is no definite statement possible, as you are developing against an interface and the specification does not mandate a particular sort algorithm.

For the generic Stream, we have to assume a comparison sort whose worst case of O(n log n) can not be avoided by any algorithm.

But your example uses an IntStream which, in principle, could use a counting sort or similar, having O(n). This doesn’t happen in practice with the reference implementation, as having a better worst case time complexity does not necessarily lead to a better performance in practice and the maximum number of elements is limited to the JVM’s maximum array size.

The time complexity of distinct() is O(n), as it just checks whether adding into a HashSet succeeds. Combining the O(n) with O(n log n) leads to an overall complexity of O(n log n). Maybe the interviewer got this combining of time complexities wrong.

But this is a good example demonstrating that time complexity is not the equal to performance. When you use sorted().distinct(), the distinct() operation will utilize the sorted nature of the incoming elements, which makes it unnecessary to build a HashSet behind the scenes¹. Since the reference implementation has no primitive value set, this eliminates a lot of boxing overhead. On the other hand, using distinct().sorted() could reduce the number of elements to sort, but it requires to have significantly less distinct elements than total stream elements to pay off.

Such performance differences are not covered by the time complexity, which still is the same for both approaches. But as said, for the streams of primitive types, different algorithms with a different time complexity would be possible.

But one thing, we can say for sure. When you split the operations into two stream operations, like requesting a result array from the first and passing it to Arrays.stream again, there is no chance that the underlying implementation utilizes knowledge about the previous operation in the next operation.

Note that the statements above assume a terminal operation like your example’s toArray that consumes all elements and requires the resulting encounter order to be maintained. With other, short-circuiting or unordered terminal operations, the overall time complexity could change, e.g. sorted().findFirst() could get optimized to the equivalent of min() or the sorting step could get eliminated for an unordered terminal operation, like sum(). That does not happen in the current reference implementation, but, as said, you’re programming against an interface.

¹ for primitive streams, this only works in Java 9+. As said, there is no primitive specialization for distinct(), it’s implemented like boxed().distinct().mapToInt(i -> i) and in Java 8, boxed() is implemented as mapToObj(Integer::valueOf) which loses the information about the sorted input, as explained in the last section of this answer.

Thank you for your answers! You're so professional! I think " there is no chance that the underlying implementation utilizes knowledge about the previous operation in the next operation." is helpful! As you say, "the overall time complexity could change, e.g. sorted().findFirst() could get optimized to the equivalent of min()", so it seems using stream has its advantages. Thank you! — Timmy Lin, Mar 26 '21 at 09:09
As you say...I think the interviewer mistakes the time complexity but according to the stream explanation, stream() will depend on the previous operation, it sounds like a nested operation, so I think that's why the interviewer consider the time complexity should O(n*nlogn) — Timmy Lin, Mar 26 '21 at 09:22

What's the time complexity of Java Stream distinct().sorted()?

1 Answers1