21

I'm faced with a problem which requires a Queue data structure supporting fast k-th largest element finding.

The requirements of this data structure are as follows:

  1. The elements in the queue are not necessarily integers, but they must be comparable to each other, i.e we can tell which one is greater when we compare two elements(they can be equal as well).

  2. The data structure must support enqueue(adds the element at the tail) and dequeue(removes the element at the head).

  3. It can quickly find the k-th largest element in the queue, pls note k is not a constant.

  4. You can assume that operations enqueue , dequeue and k-th largest element finding all occur with the same frequency.

enter image description here

My idea is to use a modified balanced binary search tree. The tree is the same as ordinary balanced binary search tree except that every nodei is augmented with another field ni, ni denotes the number of nodes contained in the subtree with root nodei. The aforementioned operations are supported as follows:

For simplicity assume that all elements are distinct.

Enqueue(x): x is first inserted into the tree, suppose the corresponding node is nodet, we append pair(x,pointer to nodet) to the queue.

Dequeue: suppose (e1, node1) is the element at the head, node1 is the pointer into the tree corresponding to e1. We delete node1 from the tree and remove (e1, node1) from the queue.

K-th largest element finding: suppose root node is noderoot, its two children are nodeleft and noderight(suppose they all exist), we compare K with nroot , three cases may happen:

  1. if K< nleft we find the K-th largest element in the left subtree of nroot;

  2. if K>nroot-nright we find the (K-nroot+nright)-th largest element in the right subtree of nroot;

  3. otherwise nroot is the node we want.

The time complexity of all the three operations are O(logN) , where N is the number of elements currently in the queue.

How can I speed up the operations mentioned above? With what data structures and how?

outlaw
  • 1,133
  • 1
  • 8
  • 17
  • 1
    This seems like a contrived container (finding the kth element but then not removing it, all operations equal frequency) - can you provide more information about the underlying problem you're trying to solve? If you really need all these requirements I think you already found a solid solution. – Mark B Sep 21 '12 at 14:22
  • @Mark B, I am processing a data stream, i want to know the current k-th largest element in the sliding window. Actually i want to find the top-k elements. – outlaw Sep 21 '12 at 14:27
  • This might be a bit of a read but [boost::multiIndex](http://www.boost.org/doc/libs/1_51_0/libs/multi_index/doc/index.html) could help with this problem. – andre Sep 21 '12 at 14:27
  • Regarding requirement 2, do the elements need to be accessible in the order in which they were added or can they be sorted? – smocking Sep 21 '12 at 14:32
  • @smocking, enqueue and dequeue operations must be in order(FIFO), no other constraints imposed on element access. – outlaw Sep 21 '12 at 14:35
  • 1
    @outlaw, Thx. Think I agree with Mark B then. If `k` were constant (which it's not), you could use a heap to keep track of the k-th element. That would allow O(1) for finding the k-th element and O(logN) for enqueue/dequeue. – smocking Sep 21 '12 at 14:46
  • I think the solution of finding k-th element is right as this page also suggests: http://stackoverflow.com/questions/2329171/find-kth-smallest-element-in-a-binary-search-tree-in-optimum-way/2329236#2329236 – Heejin Sep 21 '12 at 14:47
  • But I think you have to care about time complexity of tree rebuilding. If every enqueue and dequeue causes tree rebuilding, it will be pretty slow. – Heejin Sep 21 '12 at 14:49
  • @smocking , yes you are right, when k is a constant the time complexity is as you said, i got it – outlaw Sep 21 '12 at 14:54
  • @Heejin , yes my method of k-th finding is the same as the one you mentioned. I think tree rebuilding is not a problem here because the amortized complexity is still log(N). – outlaw Sep 21 '12 at 14:56
  • @outlaw Yes, you are right. I just thought that there would be some way to maintain a lazy data structure, so delaying rebuilding the tree until `find k-th element` operation comes in. If it is possible, you may reduce a coefficient of O(logN). – Heejin Sep 21 '12 at 15:06
  • 2
    It is worth to combine the binary search tree and the queue into a single structure. Just use one additional pointer for each node, which should point to the next node in FIFO order. – Evgeny Kluev Sep 21 '12 at 15:07
  • @Heejin: You can delay rebuilding the tree, but expected number of new/deleted elements between two "find" operations is only O(1). – Evgeny Kluev Sep 21 '12 at 15:14
  • @Evgeny Kluev Good idea! But when there exist duplicate elements this might not be possible, am i right? – outlaw Sep 21 '12 at 15:15
  • There is no problem if we have not too many duplicates. Just implement the tree like std::multiset. If there are many duplicates, the original structure with separate queue is better; here a counter of duplicates for each node is better than additional pointer. – Evgeny Kluev Sep 21 '12 at 15:20
  • @Evgeny Kluev good point, I will look into the detailed implementation of multiset . – outlaw Sep 21 '12 at 15:30

3 Answers3

9

Note - you cannot achieve better then O(logn) for all, at best you need to "chose" which op you care for the most. (Otherwise, you could sort in O(n) by feeding the array to the DS, and querying 1st, 2nd, 3rd, ... nth elements)

  • Using a skip list instead of a Balanced BST as the sorted structure can reduce dequeue complexity to O(1) average case. It does not affect complexity of any other op.
    To remove from a skip list - all you need to do is to get to the element using the pointer from the head of the queue, and follow the links up and remove each. The expected number of nodes needed to be deleted is 1 + 1/2 + 1/4 + ... = 2.
  • find Kth can be achieved in O(logK) by starting from the leftest node (and not the root) and making your way up until you find you have "more sons then needed", and then treat the just found node as the root just like the algorithm in the question. Though it is better in asymptotic complexity - the constant factor is double.
amit
  • 175,853
  • 27
  • 231
  • 333
2

I found an interesting paper:

Sliding-Window Top-k Queries on Uncertain Streams published in VLDB 2008 and cited by 71.

https://www.cse.ust.hk/~yike/wtopk.pdf

VLDB is the best conference in database research area, and the number of citations proves the data structure actually works.

The paper looks pretty difficult, but if you really need improve your data structure, I suggest you to read this paper or papers in the reference page of this paper.

Heejin
  • 4,463
  • 3
  • 26
  • 30
1

You can also use a finger tree.

For example, a priority queue can be implemented by labeling the internal nodes by the minimum priority of its children in the tree, or an indexed list/array can be implemented with a labeling of nodes by the count of the leaves in their children. Finger trees can provide amortized O(1) cons, reversing, cdr, O(log n) append and split; and can be adapted to be indexed or ordered sequences.

Also note that being a purely functional structure makes this a good choice for concurrent usage.

Dilum Ranatunga
  • 13,254
  • 3
  • 41
  • 52