1

I would like to keep track of the kth largest element (k is fixed) of an array that is continuously growing or reducing in size. Which data structure can I use to achieve the best time complexity?

When the array is only growing (inputs are coming in a stream), a min-heap achieves the optimal complexity.

When k is not fixed, a solution that uses a min-heap and a max-heap can be found here: Find running median from a stream of integers

Is the use of two heaps also optimal when k is fixed?

2 Answers2

1

If array constantly grows you would add an element to minHeap and at the time it reaches k + 1, you delete the top, so you have first k elements on the heap with minimum at the root.

I believe you could achieve the same if you use a maxHeap. You will have same top k elements after you delete the top when you reach k + 1 elements but the maximum will be at the root instead.

Now when array decreases, at each deletion look if that element being removed is greater than maxHeap's peek and that array size is still greater than or equal to k, then you don't need to do anything.

If the array size decreases below k then kth largest would become invalid and then you keep removing the maxHeap top one by one as the array decreases.

Here's the tough part:

But when the element being removed is lesser than maxHeap's peek and array size is still >= k add the element to maxHeap, if that element already exists in the heap, then there is no change in size which indicates that the element is there in the heap. Remove that element and add the last element in the array.

SomeDude
  • 13,876
  • 5
  • 21
  • 44
  • I'm not sure I follow the "tough part". Could you please clarify? Wouldn't I have to keep another heap (a min heap) to store the elements that are smaller than the smallest element in the maxHeap, so that I can move the largest element in the min heap to the max heap whenever I delete an element that is smaller than the maxHeap's peek? – OtherwiseWorth Aug 13 '18 at 23:51
  • "to store the elements that are smaller than the smallest element in the maxHeap" - if there were any elements that are smaller than the smallest element in maxHeap those would have been already captured in maxHeap, because maxHeap's capacity is capped at K. I don't understand rationale of using 2 heaps here. In running median case it makes sense because median essentially divides 2 groups of numbers into smaller and greater than median. In this case all you want to do is to make sure you capture first 'K' elements in a given array. – SomeDude Aug 14 '18 at 11:56
1

2 heaps is good as long as the desired rank changes by at most a small constant for each insert or removal, because then it only takes O(log N) work to rebalance the heaps after every insertion or removal. So fixed k, running median, running 10th percentile, etc. are all fine.

If k changes by more than that, or to allow any rank to be retrieved at any time, use an order statistic tree: https://en.wikipedia.org/wiki/Order_statistic_tree

Matt Timmermans
  • 53,709
  • 3
  • 46
  • 87