Best way to retrieve K largest elements from large unsorted arrays?

Question

I recently had a coding test during an interview. I was told:

There is a large unsorted array of one million ints. User wants to retrieve K largest elements. What algorithm would you implement?

During this, I was strongly hinted that I needed to sort the array.

So, I suggested to use built-in sort() or maybe a custom implementation if performance really mattered. I was then told that using a Collection or array to store the k largest and for-loop it is possible to achieve approximately O(N), in hindsight, I think it's O(N*k) because each iteration needs to compare to the K sized array to find the smallest element to replace, while the need to sort the array would cause the code to be at least O(N log N).

I then reviewed this link on SO that suggests priority queue of K numbers, removing the smallest number every time a larger element is found, which would also give O(N log N). Write a program to find 100 largest numbers out of an array of 1 billion numbers

Is the for-loop method bad? How should I justify pros/cons of using the for-loop or the priorityqueue/sorting methods? I'm thinking that if the array is already sorted, it could help by not needing to iterate through the whole array again, i.e. if some other method of retrieval is called on the sorted array, it should be constant time. Is there some performance factor when running the actual code that I didn't consider when theorizing pseudocode?

My first idea would indeed also be to iterate once over the array and keep track of the K largest elements, which is O(N). Since sorting is generally slower than that, I would say it's a pretty optimal solution. — Adriaan Koster, Jul 19 '22 at 11:24
@AdriaanKoster what is O(N)? keeping track of the K largest elements is a little more complicated than tracking the single largest element. — qwr, Jul 20 '22 at 18:29
@AdriaanKoster: For small K (much less than N), yeah one pass with a heap or sorted array of candidates is a good bet. You'll rarely see a new candidate greater than the current-Kth-largest seen (unless the array's initial order is trending towards increasing). And each new candidate only takes O(log K) time to insert into a heap or tree, or O(K) to insert into a sorted array. For small fixed K, O(N log K) as a worst case is basically O(N), and in practice fast. (And read-only on the original array.) — Peter Cordes, Jul 20 '22 at 18:51
The problem is a matter of confusion around big O notation: O(n) == O(n * k) for constant k. Clearly, using a priority queue or a simple array-search are both O(n) for constant k: twice the elements will take about twice the time. However, since a priority queue requires O(n log k) operations, it is asymptotically faster for large k and n (but might actually be slower for small k). The confusion is that it is subjective whether k is a constant or not, and big-O notation only categorizes asymptitic behaviour, not absolute runtime. In practise, there are other "constants" too. — Remember Monica, Jul 20 '22 at 21:22
I would use a binary tree to store the largest items so far and I would loop the (unsorted) large array and for each item I would compare it with the smallest element in the tree, ignore it if it's even smaller and adjust the tree if it's not smaller. It would be a complexity of O(N log(k)) — Lajos Arpad, Jul 23 '22 at 16:55

Berthur · Accepted Answer · 2022-07-19T12:23:00.020

23

Another way of solving this is using Quickselect. This should give you a total average time complexity of O(n). Consider this:

Find the kth largest number x using Quickselect (O(n))
Iterate through the array again (or just through the right-side partition) (O(n)) and save all elements ≥ x
Return your saved elements

(If there are repeated elements, you can avoid them by keeping count of how many duplicates of x you need to add to the result.)

The difference between your problem and the one in the SO question you linked to is that you have only one million elements, so they can definitely be kept in memory to allow normal use of Quickselect.

edited Jul 19 '22 at 12:23

answered Jul 19 '22 at 11:36

Berthur

4,300
2
14
28

@AlexanderIvanchenko Sure if you prefer. Was interesting to discuss with you though :) And you are right that quicksort is not typically used out-of-the-box in built-in solutions. – Berthur Jul 19 '22 at 22:39
2

I guess the most remarkable thing about built-in sorting algorithms in Java is that an array of `int` and a list of wrapper type would be sorting using different algorithms. Quicksort would be used for primitives and Timsort for objects because objects have identity and quicksort isn't considered suitable for them because it might change the ordering of equal elements. – Alexander Ivanchenko Jul 19 '22 at 23:09
@AlexanderIvanchenko Yes, that's the topic of stable vs unstable sorting algorithms. Quicksort is not stable, as you say. Another remarkable thing to note is that although properly implemented, Quicksort's worst-case performance can be safely neglected in practice, there is room for some adversarial fun: https://www.cs.dartmouth.edu/~doug/mdmspe.pdf – Berthur Jul 20 '22 at 09:53
2

@Berthur Quicksort absolutely can be stable, just not the schoolbook in-place implementation. – orlp Jul 20 '22 at 18:16
2

Should be noted quickselect has worst case quadratic time complexity, like quicksort. And nothing in the problem rules out adversarial inputs. – qwr Jul 20 '22 at 18:42
@qwr It's an array of integers, all given at once, so there is no on-the-fly creation of adversarial inputs. So if you use a random pivot, adversarial inputs are ruled out. As for worst-case, if you pick the pivot randomly, the probability of getting quadratic time reduces exponentially with `n`. So with the input in the millions, you are more likely to randomly pick the same atom in the universe as me :) – Berthur Jul 20 '22 at 18:58
3

The worst case is more about a theoretical guarantee, though it is conceivable an attacker could find out the RNG seed (this is common in for example tool-assisted speedruns of video games). Interesting to note that heapselect does also work as an on-line algorithm. – qwr Jul 20 '22 at 19:11
As pointed out by an answer on [Write a program to find 100 largest numbers out of an array of 1 billion numbers](https://stackoverflow.com/a/19229007), QuickSelect leaves the K highest elements at one end of the array. You don't need to re-scan the whole array, just take that end of it! – Peter Cordes Jul 20 '22 at 20:59
@qwr Ah, attacking the RNG, did not think that far, good point. But yes, we are quite deep into theory here I guess :) I was mostly defending against the main misconception that the worst-case complexity of Quickselect makes it infeasible, or less attractive, in general. – Berthur Jul 20 '22 at 21:16
@PeterCordes Yes exactly, hence `...or just through the right-side partition` :) – Berthur Jul 20 '22 at 21:17
2

The O(n^2) worst-case runtime of naive quickselect can be avoided by use of [introselect](https://en.wikipedia.org/wiki/Introselect), which is basically "quickselect, but if we're not making progress fast enough, we turn on a pivot selection scheme with better worst-case behavior". – user2357112 Jul 20 '22 at 21:26

Alexander Ivanchenko · Answer 2 · 2022-08-01T10:24:10.703

10

There is a large unsorted array of one million ints. The user wants to retrieve the K largest elements.

During this, I was strongly hinted that I needed to sort the array.

So, I suggested using a built-in sort() or maybe a custom implementation

That wasn't really a hint I guess, but rather a sort of trick to deceive you (to test how strong your knowledge is).

If you choose to approach the problem by sorting the whole source array using the built-in Dual-Pivot Quicksort, you can't obtain time complexity better than O(n log n).

Instead, we can maintain a PriorytyQueue which would store the result. And while iterating over the source array for each element we need to check whether the queue has reached the size K, if not the element should be added to the queue, otherwise (is size equals to K) we need to compare the next element against the lowest element in the queue - if the next element is smaller or equal we should ignore it if it is greater the lowest element has to be removed and the new element needs to be added.

The time complexity of this approach would be O(n log k) because adding a new element into the PriorytyQueue of size k costs O(k) and in the worst-case scenario this operation can be performed n times (because we're iterating over the array of size n).

Note that the best case time complexity would be Ω(n), i.e. linear.

So the difference between sorting and using a PriorytyQueue in terms of Big O boils down to the difference between O(n log n) and O(n log k). When k is much smaller than n this approach would give a significant performance gain.

Here's an implementation:

public static int[] getHighestK(int[] arr, int k) {
    Queue<Integer> queue = new PriorityQueue<>();
    
    for (int next: arr) {
        if (queue.size() == k && queue.peek() < next) queue.remove();
        if (queue.size() < k) queue.add(next);
    }
    
    return toIntArray(queue);
}

public static int[] toIntArray(Collection<Integer> source) {
    return source.stream().mapToInt(Integer::intValue).toArray();
}

main()

public static void main(String[] args) {
    System.out.println(Arrays.toString(getHighestK(new int[]{3, -1, 3, 12, 7, 8, -5, 9, 27}, 3)));
}

Output:

[9, 12, 27]

Sorting in O(n)

We can achieve worst case time complexity of O(n) when there are some constraints regarding the contents of the given array. Let's say it contains only numbers in the range [-1000,1000] (sure, you haven't been told that, but it's always good to clarify the problem requirements during the interview).

In this case, we can use Counting sort which has linear time complexity. Or better, just build a histogram (first step of Counting Sort) and look at the highest-valued buckets until you've seen K counts. (i.e. don't actually expand back to a fully sorted array, just expand counts back into the top K sorted elements.) Creating a histogram is only efficient if the array of counts (possible input values) is smaller than the size of the input array.

Another possibility is when the given array is partially sorted, consisting of several sorted chunks. In this case, we can use Timsort which is good at finding sorted runs. It will deal with them in a linear time.

And Timsort is already implemented in Java, it's used to sort objects (not primitives). So we can take advantage of the well-optimized and thoroughly tested implementation instead of writing our own, which is great. But since we are given an array of primitives, using built-in Timsort would have an additional cost - we need to copy the contents of the array into a list (or array) of wrapper type.

edited Aug 01 '22 at 10:24

answered Jul 19 '22 at 12:04

Alexander Ivanchenko

25,667
5
22
46

3

Your claim that big-O is about worst case is misconception #4 in http://ssp.impulsetrain.com/big-o.html. Big-O is just about categorizing functions, and those functions can just as easily be about the best case or average case as well as the worst case. – btilly Jul 19 '22 at 14:38
1

@btilly It happens to be a widespread misconception... Thanks for pointing at my delusion. Fixed the answer. – Alexander Ivanchenko Jul 19 '22 at 23:29
The time complexities are specifically for a priority queue implemented as binary heap (which I assume is true in most languages, but maybe not all). – qwr Jul 20 '22 at 18:25
A question/nitpick: "If you choose to approach the problem by sorting the whole source array, you can't obtain time complexity better than O(n log n)." this is only true if you use *comparison* sorts, correct? Since the domain is ints, which are technically enumerated, you could use non-comparison sorts in O(n) time, as other answers do. Not a good idea in practice, but for a puzzle would probably get you there – en_Knight Jul 20 '22 at 19:27
@qwr I'm not sure that I've got your comment correctly. For most of the implementations and specifically for the `PriorityQueue` from the JDK *enqueuing* operation (call `add`) has a logarithmic cost as I've pointed. And removal of the lowest element (in `PriorityQueue` representing the Min Heap) would happen in **O(1)**. – Alexander Ivanchenko Jul 20 '22 at 19:47
My comment is just a small note on explicitly how the listed time complexities are for a given implementation as a binary heap. Priority queue is the abstract data structure and binary heap is the implementation. – qwr Jul 20 '22 at 20:09
Also delete-min has a worst case of O(log k) since the operation has to "sift-down" the new element at the root of the binary heap. – qwr Jul 20 '22 at 20:11
@en_Knight: Yes, CountingSort is O(n), and could actually be good if your integers are all in a small range (e.g. 16-bit) so you can use an array of counts instead of a giant hash table or other dictionary, or a 16 GiB array of 2^32 x 4-byte counters. Even better, instead of expanding the histogram back to a sorted array, just use the histogram to answer order-statistic queries. See my edit to the community-wiki top answer on [Write a program to find 100 largest numbers out of an array of 1 billion numbers](https://stackoverflow.com/a/19227860) – Peter Cordes Jul 20 '22 at 20:13
@en_Knight The phrase you've quoted was referring to OP's intention to use built-in sorting (I guess it's not obvious from the text, I'll change it). With *DualPivot Quicksort* in Java you can't get time complexity better than *linear-logarithmic*. I've also mentioned the case when we can apply *Counting sort* and achieve the time complexity of **O(n)** (see at the very end of the answer). – Alexander Ivanchenko Jul 20 '22 at 20:23
@en_Knight `in O(n) time, as other answers do` - You're referring to this [*answer*](https://stackoverflow.com/a/73037118/17949945) - well, at the very beginning it was completely impractical (since it suggests creating array of size `Integer.MAX_VALUE`), a few hours ago the author has added a constraint assuming that input is in withing range `[-1000, 1000]` which **suspiciously resembles** the arbitrary numbers I've used in my answer **yesterday** (check the history of editing for both answers). I hope I've addressed your nitpick) ? – Alexander Ivanchenko Jul 20 '22 at 20:23
@qwr Sorry, but you're not reading my comments carefully. I'm **not** talking *abstract data structures*. [`PriorityQueue`](https://docs.oracle.com/en/java/javase/17/docs/api/java.base/java/util/PriorityQueue.html) is a concrete implementation from the JDK. Removing the root (see methods `remove()`/`poll()`) costs **O(1)** - no one can beat the documentation. – Alexander Ivanchenko Jul 20 '22 at 20:46
@PeterCordes One interesting remark. Built-in *DualPivot Quicksort* will perform *Countingsort* on an array of `short`, `char` and `byte` if its length greater than `1750`. – Alexander Ivanchenko Jul 20 '22 at 20:59
@en_Knight Besides the fact that counting sort will **not** give any performance gain in case if array of size `Integer.MAX_VALUE` is used to store frequencies, this approach is not even *viable,* it would give `OutOfMemoryError` without changing heap-size - [*Proof*](https://www.jdoodle.com/ia/tph). And obviously the whole range of int values can't fit into `Integer.MAX_VALUE`. Applying *Counting sort* on an `int` array simply doesn't make sense when there are no constraints. – Alexander Ivanchenko Jul 20 '22 at 21:02
If you were going to histogram arbitrary 32-bit Java `int`s into an array of `int` counters, you'd need one array of negative, another for non-negative integers. (8 GiB each). Once you've seen K non-negative integers, you can stop counting negative numbers and empty / delete that array, since you know the Kth largest numbers won't include any of those. (e.g. jump to another version of the loop, or just keep doing an if inside it). But worst-case you're still using all 16 GiB of heap memory the whole time. – Peter Cordes Jul 20 '22 at 21:07
sounds good, I agree there's no viable approach with O(n) sorting on ints so it's really just a semantic quibble with the quoted sentence (in an interview context like this question, it's sometimes worth *mentioning* that the oft-quoted O(nlog(n)) lower-bound can be circumvented if you an exploit structure, but I agree with you: not really viable here). I like the answer! – en_Knight Jul 20 '22 at 21:11
2

Even in C on a 64-bit machine (where you can easily have a `uint32_t counts[0x100000000] = {0};` (i.e. 2^32 x 4-byte elements), it would likely perform badly. Those scattered increments would often miss in the TLB and cache. Especially with medium-sized problems like N = 1 million, just zeroing a count array 4096 times that large is very costly! So yeah, crazy. And not good even for much larger arrays of arbitrary `int` – Peter Cordes Jul 20 '22 at 21:11
@PeterCordes `If you were going to histogram arbitrary 32-bit Java int` - No-no-no, I didn't tell this neither in the answer no in the comments. This crazy idea (sorry, I don't want to offend anyone) is represented in [*this answer*](https://stackoverflow.com/a/73037118/17949945) to which @en_Knight probably refers to. – Alexander Ivanchenko Jul 20 '22 at 21:11
If it isn't clear yet, +1 from me :) – en_Knight Jul 20 '22 at 21:17
@en_Knight: Radix Sort or Bucket Sort are potentially possible for medium-sized N like this. Those have the advantage that (like QuickSelect vs. QuickSort), you can narrow down to just the buckets (`ArrayList`s) containing the K highest numbers. e.g. if you were distributing into 256 ArrayLists based on the leading 8 bits of the `int` value, and there are more than K elements in the top bucket, you can discard all the other buckets. Or if there are fewer than K, you know they're *all* part of the K highest, and can find the cutoff in the next bucket down. But the const factor is high. – Peter Cordes Jul 20 '22 at 21:22
2

@en_Knight Thanks for your approval. Since we are having a conversation, there's one more **O(n)** case which no one has mentioned - when we are sorting an array that has been already sorted (or consists from a couple of sorted chunks) using *Timsort* (probably because OP said that array is unsorted, but interview is not a practical task, it's about demonstrating knowledge). In Java built-in *Timsort* would be used to sort a collection of wrapper type. – Alexander Ivanchenko Jul 20 '22 at 21:28
@PeterCordes (1) i like your profile picture, seems appropriate haha (2) ya, imo a radixsort-like approach is helfpul to have in your head (though not to implement), so that in an interview you can say "well the *theoretical* complexity is O(n) because I can exploit structure in this simple but impractical way: here's my practical solution that has slightly worse time complexity". Similar to Alexander's other comments about ways you could improve by imposing additional constraints; they would show to an interviewer a stronger fundamental command yknow – en_Knight Jul 20 '22 at 21:33
@AlexanderIvanchenko I'm not familiar with the JDK docs but the link you gave me at the top of the page says poll (which I understand as pop or delete-min) is log(n) time? – qwr Jul 20 '22 at 21:55
@qwr Sorry, my bad.. I've messed the things a bit. Accessing the *root* via `pick()` which also take place in my implementation, has a cost **O(1)**. But removing the *root* would be done in logarithmic time. You were correct. – Alexander Ivanchenko Jul 20 '22 at 22:03
in practice O(n) with high constant factors may be slower than O(n log n), since log n is so small. There is a funny notation I've encountered that is written as Õ(n) which just ignores log factors. – qwr Jul 20 '22 at 22:08

qwr · Answer 3 · 2022-07-20T18:42:37.777

6

This is a classic problem that can be solved with so-called heapselect, a simple variation on heapsort. It also can be solved with quickselect, but like quicksort has poor quadratic worst-case time complexity.

Simply keep a priority queue, implemented as binary heap, of size k of the k smallest values. Walk through the array, and insert values into the heap (worst case O(log k)). When the priority queue is too large, delete the minimum value at the root (worst case O(log k)). After going through the n array elements, you have removed the n-k smallest elements, so the k largest elements remain. It's easy to see the worst-case time complexity is O(n log k), which is faster than O(n log n) at the cost of only O(k) space for the heap.

edited Jul 20 '22 at 18:42

answered Jul 20 '22 at 18:23

qwr

9,525
5
58
102

Heapselect just being to in-place Heapify the array (O(N) average time), then extract K elements in `O(K * log(N))` time, right? [How can building a heap be O(n) time complexity?](https://stackoverflow.com/q/9755721) – Peter Cordes Jul 20 '22 at 20:24
@PeterCordes oh that's not the algorithm I was proposing (I'm not sure if heapselect is the right name, but it is just a variation of heapsort). Idk how to build a n-sized heap in O(n) and even then my algorithm has all K elements already in the heap (but not sorted) – qwr Jul 20 '22 at 21:50
I was guessing at how HeapSelect might work, since I wasn't familiar with it. Not your 2nd paragarph which seems to be discussing the standard one-pass algorithm with a priority queue. Hmm, from googling, it seems HeapSelect is about producing just the Kth element, not the whole set of elements. Also related: https://en.wikipedia.org/wiki/Selection_algorithm#Language_support – Peter Cordes Jul 20 '22 at 21:59
I was planning on contributing a page to cp-algorithms.com on this k-largest numbers problem, where many of these kinds of algorithms for competitive programming are detailed (the site started as a translation of e-maxx.ru) but it's in my backlog – qwr Jul 20 '22 at 22:02

Level_Up · Answer 4 · 2022-07-20T13:13:47.893

4

Here is one idea. I will think for creating array (int) with max size (2147483647) as it is max value of int (2147483647). Then for every number in for-each that I get from the original array just put the same index (as the number) +1 inside the empty array that I created.

So in the end of this for each I will have something like [1,0,2,0,3] (array that I created) which represent numbers [0, 2, 2, 4, 4, 4] (initial array).

So to find the K biggest elements you can make backward for over the created array and count back from K to 0 every time when you have different element then 0. If you have for example 2 you have to count this number 2 times.

The limitation of this approach is that it works only with integers because of the nature of the array...

Also the representation of int in java is -2147483648 to 2147483647 which mean that in the array that need to be created only the positive numbers can be placed.

NOTE: if you know that there is max number of the int then you can lower the created array size with that max number. For example if the max int is 1000 then your array which you need to create is with size 1000 and then this algorithm should perform very fast.

edited Jul 20 '22 at 13:13

answered Jul 19 '22 at 12:42

Level_Up

789
5
19

isnt this similar to that I think it was counting or radix sort? – Tan Yu Hau Sean Jul 19 '22 at 12:47
Yes exactly. This is idea – Level_Up Jul 19 '22 at 12:49
2

@TanYuHauSean: Yes, this is the histogram part of CountingSort; you just use the histogram directly to answer the queries instead of expending back into an array. You'd either need 2 arrays of counts (for positive or negative), or if Java can use `long` to index arrays, use `2147483648 + (long)input[i]` as the index into a 16GiB array of 2^32 `int` counters. Or once you've seen 100 non-negative numbers, you can skip counting any more negative numbers and delete that array. See also [this answer](https://stackoverflow.com/a/19227860/224132) – Peter Cordes Jul 20 '22 at 20:20
2

Histogramming arbitrary 32-bit `int` is not worth it. Note that zeroing an array of counts would need to write 16 GiB of memory, but the input is only 1 million ints (4 MiB). So yeah, **only worth considering when the range is limited, so the count array can be significantly smaller than the input size.** – Peter Cordes Jul 20 '22 at 21:15

GeertPt · Answer 5 · 2022-07-25T16:28:41.627

3

I think you misunderstood what you needed to sort.

You need to keep the K-sized list sorted, you don't need to sort the original N-sized input array. That way the time complexity would be O(N * log(K)) in the worst case (assuming you need to update the K-sized list almost every time).

The requirements said that N was very large, but K is much smaller, so O(N * log(K)) is also smaller than O(N * log(N)).

You only need to update the K-sized list for each record that is larger than the K-th largest element before it. For a randomly distributed list with N much larger than K, that will be negligible, so the time complexity will be closer to O(N).

For the K-sized list, you can take a look at the implementation of Is there a PriorityQueue implementation with fixed capacity and custom comparator? , which uses a PriorityQueue with some additional logic around it.

edited Jul 25 '22 at 16:28

answered Jul 19 '22 at 12:47

GeertPt

16,398
2
37
61

2

A *sorted* K-sized list would take O(K) time per insertion to maintain. As you say, normally you'd use a PriorityQueue, which may use [a heap data structure](https://en.wikipedia.org/wiki/Heap_(data_structure)#Comparison_of_theoretic_bounds_for_variants), so you can pull out the smallest and insert the new in O(log K) time. For very small K, the simplicity of a sorted array can be a win. But a heap isn't a "sorted list"; you can't traverse it in-order in O(K) time. – Peter Cordes Jul 20 '22 at 18:58
Most of the time you only need access to listK.last(), which should accessible in O(1). You only need to insert and replace an item if listK is not full yet, or if the item is larger than listK.last(). If N is much larger than K, and randomly distributed, the number of insertions is probably negligible (can't do the math right now). – GeertPt Jul 22 '22 at 08:55
Fair point about insertions probably tending to not go very high into the array, so you might on average not need to copy O(K) elements on a typical insertion. But that's probably only true with uniformly distributed elements in your big array. If it's monotonically increasing, you're always seeing a new max bigger than all K elements. That's also the worst case for a priority queue, but O(log K) replacement makes it less bad, unless you maybe have adversarial inputs that find the worst case for a heap. (For small K on real CPUs with wide SIMD, array insertion work is just a fast memmove..) – Peter Cordes Jul 22 '22 at 15:48
The "not full yet" case is barely relevant: you start by sorting the first K elements of the big array to populate your initial candidate list; it makes sense to do that as a separate step, so you aren't checking for "not full" every time through the main loop for a million elements. And so you can use an O(k log k) sort, instead of effectively InsertionSort for the first 100 elements. – Peter Cordes Jul 22 '22 at 15:52

score 3 · Answer 6 · edited Aug 01 '22 at 07:25

There is an algorithm to do this in worst-case time complexity O(n*log(k)) with very benign time constants (since there is just one pass through the original array, and the inner part that contributes to the log(k) is only accessed relatively seldomly if the input data is well-behaved).

Initialize a priority queue implemented with a binary heap A of maximum size k (internally using an array for storage). In the worst case, this has O(log(k)) for inserting, deleting and searching/manipulating the minimum element (in fact, retrieving the minimum is O(1)).
Iterate through the original unsorted array, and for each value v:
- If A is not yet full then
  - insert v into A,
- else, if v>min(A) then (*)
  - insert v into A,
  - remove the lowest value from A.

(*) Note that A can return repeated values if some of the highest k values occur repeatedly in the source set. You can avoid that by a search operation to make sure that v is not yet in A. You'd also want to find a suitable data structure for that (as the priority queue has linear complexity), i.e. a secondary hash table or balanced binary search tree or something like that, both of which are available in java.util.

The java.util.PriorityQueue helpfully guarantees the time complexity of its operations:

this implementation provides O(log(n)) time for the enqueing and dequeing methods (offer, poll, remove() and add); linear time for the remove(Object) and contains(Object) methods; and constant time for the retrieval methods (peek, element, and size).

Note that as laid out above, we only ever remove the lowest (first) element from A, so we enjoy the O(log(k)) for that. If you want to avoid duplicates as mentioned above, then you also need to search for any new value added to it (with O(k)), which opens you up to a worst-case overall scenario of O(n*k) instead of O(n*log(k)) in case of a pre-sorted input array, where every single element v causes the inner loop to fire.

A priority queue implemented with a binary heap has guaranteed worst case inserts and delete-min of O(log n). I believe this is the same as the self-balancing BST, but a little less complicated in its heap operations. — qwr, Jul 20 '22 at 18:38
The original question didn't mention duplicates but if you want to handle those a self-balancing BST gives you search always in log time? so it would be better in worst case time complexity — qwr, Jul 22 '22 at 22:16
Checking duplicates in a hashtable has better average case performance of constant but worse worst case performance of linear — qwr, Jul 22 '22 at 22:23
Yeah, I was pondering whether I should say more about that aspect, but frankly I think (especially as it's not mentioned in the question) I'll leave that up to the reader (but have mentioned your points briefly). — AnoE, Jul 25 '22 at 09:16

Best way to retrieve K largest elements from large unsorted arrays?

6 Answers6

Sorting in O(n)

Linked