1

Given a sequence of data (it may have duplicates), a fixed-sized moving window, move the window at each iteration from the start of the data sequence, such that (1) the oldest data element is removed from the window and a new data element is pushed into the window (2) find the median of the data inside the window at each moving.

Here window size is 7, lets call it m. m = window size n = number of elements in sequence, it may be 1000 or 10000

Median for 4, 6, 99, 10, 90, 12, 17,1,21,32  : 12
Median for 6, 99, 10, 90, 12, 17, 1,21,32  : 12
Median for 99, 10, 90, 12, 17, 1, 21,32 : 17
Median for 10, 90, 12, 17, 1, 21, 32 : 17

I implemented this thing with help of Quicksort of m elements each time, (median of three as pivot). But this takes a lot of time. Every time sorting is require. I supposed to implement min and max heap solution as mentioned here

Problem in Min-Max heap solution :

When a new data is pushed in to the window, remove the oldest data from one of the heap and compare the new data with the top of max and min heap so that to decide which heap the data to be put. Then, find the median just like in the first iteration.

  1. How to find remove the oldest data from heap, how can we maintain this. As per given example at second time 4 is oldest element, third time 6 is oldest element. How can we remove it from heap.

  2. Followup question of above question, how to find a data element in a heap is a problem. Heap is a binary tree not a binary search tree.

Any help would be appreciated.

EDIT : I already have input data, So no insertion happening.Only happening to fixed size queue or window not for actual input sequence.

Thanks

Community
  • 1
  • 1
sam_k
  • 5,983
  • 14
  • 76
  • 110
  • The normal way would be to keep reverse pointers mapping heap indices to the array positions and array positions to heap indices, and which the heap swap and other operations then have sufficient information to maintain. – doynax Mar 23 '15 at 15:00
  • @doynax can you please explain little bit more. I didnt understand what you have said – sam_k Mar 23 '15 at 16:48
  • @sam_k Regarding your edit, an insertion and deletion would be happening each time the window moves. – sray Mar 23 '15 at 16:52
  • @sam_k: I'm proposing that you cross-reference the data-structure in both directions so that your operations can efficiently reach both ways. For every element in the current window you maintain an index into the heap where it is stored, and conversely for every heap entry you keep an index to the window where the data is to be found. Whenever an item is moved on the heap you look at the heap-to-window index and use it to update the converse window-to-heap index reaching back to new heap position. – doynax Mar 23 '15 at 17:16
  • Thanks @doynax for your comment. I will try to understand what you have said. I am new in data structure, will try to understand this. If possible to explain your answer with example as a Answer on this post. – sam_k Mar 23 '15 at 17:54

2 Answers2

0

For a running median of data (moving window in which both insertions and deletions are happening), it might be more useful to use a single binary search tree instead of the min-max heaps for the elements in the moving window.

The binary search tree would also be an order statistic tree storing the size of the subtree rooted at each node.

This tree would be able to do insertions, deletions and median computations in O(log n).

In the above example given, each operation will have time complexity O(log 7).

sray
  • 584
  • 3
  • 8
  • BST is good option that Quick sort for each time (7log7) for 500 elements? – sam_k Mar 23 '15 at 17:00
  • For windows size 10, quick sort working fine.But for 100 window size and elements are 500. Then it becomes worst – sam_k Mar 23 '15 at 17:01
  • Quicksort is O(n logn), whereas BST will answer in only O(log n). Even for window of size 100, its extremely cheap. – sray Mar 23 '15 at 17:05
  • @sray: You don't actually need a full quicksort though, you can just ignore anything not straddling the median and perform the operation in linear time. Still, the logarithmic heap and BST implementations are more efficient. Personally I would go with the heap approach just because I'm terrified of heaving to implement a balanced binary search tree ;) – doynax Mar 23 '15 at 17:22
  • @doynax Quicksort is certainly not needed, and would be the least efficient solution! I think heap solution is good for online median problem. But here you would need a queue in addition to the 2 heaps. And the heaps would need to be readjusted (O(logn) operations) every time any heap has 2 elements more than the other one. – sray Mar 23 '15 at 17:29
  • @sray: I think the heap storage overhead should win out actually (heap-to-window index and a window-to-heap-index) compared to the BST (two-pointers child pointers and a window index/data). The performance is trickier to judge, either solution is logarithmic but I'm not too clear on how the constant factors would turn out in practice. – doynax Mar 23 '15 at 17:46
0

In addition to your heap DS, also hold a queue of pointers (or references), where each element in the queue points to the node representing the relevant entry in the heap.

When you move the window by 1:

  1. dequeue old element
  2. follow the pointer, to find the node in the binary heap.
  3. "switch" the "last" node (rightest element in the lowest level) with the node you just found
  4. Remove the new "last" node
  5. Re-heapify
  6. create new entry to the binary heap with the next element
  7. enqueue the pointer the node just added
  8. re-calculate median
amit
  • 175,853
  • 27
  • 231
  • 333
  • Don't forget that heapifying also shuffles unrelated nodes, requiring corresponding updates to the pointer queue. In effect need pointers from the heap to the queue and from the queue to the heap (perhaps in place of the value). Any reordering of the nodes within the heap may then efficiently update the pointers from both ends. – doynax Mar 23 '15 at 16:38
  • @doynax re-heapfy can be done by changing the auxillary data of each node and not touching the value, so a node will represent value `x` for all its existence. – amit Mar 23 '15 at 16:48
  • Thanks amit for your answer, these above operations are efficient for window size 200 and total elements 1000.. Because in quick sort i got total time as 5 seconds. You think it will be decrease in above min-max solution? – sam_k Mar 23 '15 at 17:05
  • @amit: You mean by letting the heap store pointers to node objects with back-indices into the heap, which would be updated while reordering the heap? – doynax Mar 23 '15 at 17:05
  • @amit why we require to store pointers of nodes. We can not store value of nodes. then find from binary heap and remove it. Or pointer option will decrease the time to find element from heap? – sam_k Mar 23 '15 at 17:40
  • @doynax The heap is a binary tree, instead of switching value between node1,node2 while swapping, you can swap `left` and `right`, `parent` attributes - and in the linked nodes via these attributes as well similarly. This is constant overhead – amit Mar 23 '15 at 17:41
  • @sam_k If you are using a heap, search by value is inefficient in a heap, it is `O(n)` operation. But if you have the pointer to the node itself, deletion is much more efficient and done in `O(logn)`. – amit Mar 23 '15 at 17:42
  • @amit: I see what you're getting at. You can still use an implicit heap array though, as long as you maintain a back-index. That way you two save "words" of overhead per node. – doynax Mar 23 '15 at 17:51
  • @amit May i know how can i search with pointer in heap, i am new in Data structure and C++, so can you please explain me in your answer more with little pointer example if possible – sam_k Mar 24 '15 at 23:54
  • @amit if i am using vector for heap storage then iterator would be its pointer? – sam_k Mar 24 '15 at 23:55