find median in a fixed-size moving window along a long sequence of data

Question

Given a sequence of data (it may have duplicates), a fixed-sized moving window, move the window at each iteration from the start of the data sequence, such that (1) the oldest data element is removed from the window and a new data element is pushed into the window (2) find the median of the data inside the window at each moving.

The following posts are not helpful.

Effectively to find the median value of a random sequence

joining data based on a moving time window in R

My idea:

Use 2 heaps to hold median. In side the window, sort the data in the window in the first iteration, the min heap holds the larger part and the max heap holds the smaller part. If the window has odd number of data, the max heap returns the median otherwise the arithmetic mean of the top elements of the two heaps is the median.

When a new data is pushed in to the window, remove the oldest data from one of the heap and compare the new data with the top of max and min heap so that to decide which heap the data to be put. Then, find the median just like in the first iteration.

But, how to find a data element in a heap is a problem. Heap is a binary tree not a binary search tree.

Is it possible to solve it with O(n) or O(n * lg m) where m is the window size and space: O(1) ?

Any help is really appreciated.

Thanks

Are http://stackoverflow.com/questions/5527437/rolling-median-in-c-turlach-implementation or http://stackoverflow.com/questions/1309263/rolling-median-algorithm-in-c useful? — Martin Beckett, Mar 23 '12 at 15:00
Is the newest data the next item, or is there some other criteria? Are you processing these items in a first in, first out type of way? — Glenn, Mar 23 '12 at 15:04
@Glenn, each iteration, the oldest data is deleted from the window and a new data is put into the window and then find the new median in the window. For the oldest data , it is FIFO. thanks — user1002288, Mar 23 '12 at 15:22
I don't think, space O(1) is possible. You need to store the window contents, so you won't get below O(m). — hc_, Mar 23 '12 at 15:25

score 13 · Accepted Answer · answered Mar 23 '12 at 15:34

13

O(n*lg m) is easy:

Just maintain your window as two std::sets, one for the lower half, one for the upper half. Insertion of a new element costs O(lg m), finding and removal of an old element costs the same. Determining the median using the method you described in your question costs O(1).

As you slide the window over your sequence, in each iteration you remove the item falling out of the window (O(lg m)), insert the new item (O(lg m)) and compute the median (O(1)), resulting in a total of O(n lg m).

This solution uses space O(m), of course but I don't think you can get away without storing the window's contents.

answered Mar 23 '12 at 15:34

hc_

2,628
1
18
19

1

stl:set is a binary search tree. heap is a binary tree. You can find median on top of a heap with O(1), but you cannot do it within O(1) to find the median in a binary search tree because the top element in top of a binary search tree may not be the max or min of the tree (unlike a heap). thanks ! – user1002288 Mar 23 '12 at 15:53
1

@user1002288: He's saying use *two* BST's - one for the lower `m/2 - 1` elements, and one for the upper `m/2 - 1` elements, and also store the median. When you move the window over: **1.** Remove the old element from either the lower- or upper-tree (`O(log m)`), then **2.** Insert the new element into the lower- or upper-tree, based on if it's < or > the median. If one of trees now has too many elements, remove the largest *(for the lower)* or smallest *(for the upper)* element (`O(log m)`), call that new median, and insert the old median into the other tree (`O(log m)`). Total `O(log m)` – BlueRaja - Danny Pflughoeft Mar 23 '12 at 16:06
1

@user1002288 Also, finding the min/max of a binary tree *can* be `O(1)` - just keep a reference to them. C++'s `stl::set` already does this, via `*stl::set::rbegin()` – BlueRaja - Danny Pflughoeft Mar 23 '12 at 16:10
@BlueRaja, it is a good idea. But, there may be duplicates. e.g. 3 , 6 , 6 , 7 , 9, if one of 6s is the oldest one and it is removed, but another 6 is still left. In stl:set, no duplicates are allowed. If we use multiset, the all data associated with the same key will all be deleted. It means that both 6s will be deleted. It is wrong. thanks . – user1002288 Mar 23 '12 at 16:53
@user1002288: So use a different BST that allows duplicates? The idea is still correct. – BlueRaja - Danny Pflughoeft Mar 23 '12 at 16:58
@BlueRaja, if a node has more than 2 children, is it still a BST ? If no, how to verify your idea is correct ? thanks ! – user1002288 Mar 23 '12 at 17:34
1

@user1002288: No, just use a BST that allows duplicates eg. Instead of the left subtree being `<` the parent and the right being `>`, make the left `<=` and the right `>`. Each node should still have two *(or fewer)* children. – BlueRaja - Danny Pflughoeft Mar 23 '12 at 17:36
@BlueRaja, Ok, I see, the space complexity is O(window size), is it possible to reduce it to O(1) or O(lg window size) ? thanks ! – user1002288 Mar 23 '12 at 17:58
@user1002288: No, it's `O(log m)` (m = window size) each step, making the overall algorithm `O(n log m)`; see the answer and/or my first comment above for specifics. There *may* be a way to make it `O(1)` per step, but if so, I don't know how. – BlueRaja - Danny Pflughoeft Mar 23 '12 at 18:05
@BlueRaja, you use two trees to hold the window data, the total size is O(m), right ? – user1002288 Mar 23 '12 at 18:08
@user1002288: Oh sorry, I missed that you asked about space complexity, not time complexity. I doubt it can get any better than `O(m)`, but I could be wrong. – BlueRaja - Danny Pflughoeft Mar 23 '12 at 19:09
I agree that `O(m)` is the best space complexity. It's impossible to re-calculate the median without keeping track of all the values. You might be able to keep a count of how many of each value, but that still only reduces to `O(p*m)`, where p is the percent unique values: `numUnique/m`. This may be smaller for some data, but it's still O(m). – AShelly Mar 23 '12 at 19:34
This is probably the best way to do it using STL only, but it is still cumbersome, as this problem shows perfectly that tree containers are indeed missing in STL. With AVL tree, for example, you could just have a single tree (for the window's contents). – stf Mar 20 '15 at 00:37
How to do remove the oldest data from one of the heap ? – sam_k Mar 22 '15 at 16:07

score 7 · Answer 2 · edited May 23 '17 at 11:52

I have implemented almost exactly the algorithm you describe here: http://ideone.com/8VVEa, and described it here: Rolling median in C - Turlach implementation

The way to get around the "find oldest" problem is to keep the values in a circular buffer, so you always have a pointer to the oldest. What you store in the heap are buffer indexes. So the space requirement is 2M, and each update is O(lg M).

score 1 · Answer 3 · answered Jun 27 '12 at 14:45

I gave this answer for the "rolling median in C" question

I couldn't find a modern implementation of a c++ data structure with order-statistic so ended up implementing both ideas in top coders link ( Match Editorial: scroll down to FloatingMedian).

Two multisets

The first idea partitions the data into two data structures (heaps, multisets etc) with O(ln N) per insert/delete does not allow the quantile to be changed dynamically without a large cost. I.e. we can have a rolling median, or a rolling 75% but not both at the same time.

Segment tree

The second idea uses a segment tree which is O(ln N) for insert/deletions/queries but is more flexible. Best of all the "N" is the size of your data range. So if your rolling median has a window of a million items, but your data varies from 1..65536, then only 16 operations are required per movement of the rolling window of 1 million!! (And you only need 65536 * sizeof(counting_type) bytes, e.g. 65536*4).

GNU Order Statistic Trees

Just before giving up, I found that stdlibc++ contains order statistic trees!!!

These have two critical operations:

iter = tree.find_by_order(value)
order = tree.order_of_key(value)

See libstdc++ manual policy_based_data_structures_test (search for "split and join").

I have wrapped the tree for use in a convenience header for compilers supporting c++0x/c++11 style partial typedefs:

#if !defined(GNU_ORDER_STATISTIC_SET_H)
#define GNU_ORDER_STATISTIC_SET_H
#include <ext/pb_ds/assoc_container.hpp>
#include <ext/pb_ds/tree_policy.hpp>

// A red-black tree table storing ints and their order
// statistics. Note that since the tree uses
// tree_order_statistics_node_update as its update policy, then it
// includes its methods by_order and order_of_key.
template <typename T>
using t_order_statistic_set = __gnu_pbds::tree<
                                  T,
                                  __gnu_pbds::null_type,
                                  std::less<T>,
                                  __gnu_pbds::rb_tree_tag,
                                  // This policy updates nodes'  metadata for order statistics.
                                  __gnu_pbds::tree_order_statistics_node_update>;

#endif //GNU_ORDER_STATISTIC_SET_H

score 1 · Answer 4 · answered Mar 23 '12 at 16:57

1

Same answer as hc_ but instead of using a stock BST use a version where every node has the count of elements in that sub-tree. This way finding median is O(log(m)).

answered Mar 23 '12 at 16:57

ElKamina

7,747
28
43

score 0 · Answer 5 · answered Jun 27 '12 at 15:10

I attach my segment tree (see my other post) which allows the frequency distribution of counts to be queried very efficiently.

This implements the following data structure:

|-------------------------------|
|---------------|---------------|
|-------|-------|-------|-------|
|---|---|---|---|---|---|---|---|
  0   1   2   3   4   5   6   7

Each segment keeps the number of counts items in the range it covers. I use 2N segments for a range of value from 1..N. These are placed in a single rolled out vector rather than the tree format show figuratively above.

So if you are calculating rolling medians over a set of integers which vary from 1..65536, then you only need 128kb to store them, and can insert/delete/query using O(ln N) where N = the size of the range, i.e. 2**16 operations.

This is a big win if the data range is much smaller than your rolling window.

#if !defined(SEGMENT_TREE_H)
#define SEGMENT_TREE_H
#include <cassert>
#include <array>
#include <algorithm>
#include <set>

#ifndef NDEBUG
#include <set>
#endif

template<typename COUNTS, unsigned BITS>
class t_segment_tree
{
    static const unsigned                       cnt_elements    = (1 << BITS);
    static const unsigned                       cnt_storage     = cnt_elements << 1;
    std::array<COUNTS, cnt_elements * 2 - 1>    counts;
    unsigned                                    count;

#ifndef NDEBUG
    std::multiset<unsigned>                     elements;
#endif
    public:

    //____________________________________________________________________________________

    //  constructor

    //____________________________________________________________________________________
    t_segment_tree(): count(0)
    {
        std::fill_n(counts.begin(), counts.size(),  0);
    }
    //~t_segment_tree();

    //____________________________________________________________________________________

    //  size

    //____________________________________________________________________________________
    unsigned size() const  { return count; }

    //____________________________________________________________________________________

    //  constructor

    //____________________________________________________________________________________
    void insert(unsigned x)
    {
#ifndef NDEBUG
        elements.insert(x);
        assert("...............This element is too large for the number of BITs!!..............." && cnt_elements > x);
#endif
        unsigned ii = x + cnt_elements;
        while (ii)
        {
            ++counts[ii - 1];
            ii >>= 1;
        }
        ++count;
    }

    //____________________________________________________________________________________

    //  erase 

    //      assumes erase is in the set
    //____________________________________________________________________________________
    void erase(unsigned x)
    {
#ifndef NDEBUG
        // if the assertion failed here, it means that x was never "insert"-ed in the first place
        assert("...............This element was not 'insert'-ed before it is being 'erase'-ed!!..............." && elements.count(x));
        elements.erase(elements.find(x));
#endif
        unsigned ii = x + cnt_elements;
        while (ii)
        {
            --counts[ii - 1];
            ii >>= 1;
        }
        --count;
    }

    // 
    //____________________________________________________________________________________

    //  kth element

    //____________________________________________________________________________________
    unsigned operator[](unsigned k)
    {
        assert("...............The kth element: k needs to be smaller than the number of elements!!..............." && k < size());
        unsigned ii = 1;
        while (ii < cnt_storage)
        {
            if (counts[ii - 1] <= k)
               k -= counts[ii++ - 1];
            ii <<= 1;
        }
        return (ii >> 1) - cnt_elements;
    }

};
#endif

find median in a fixed-size moving window along a long sequence of data

5 Answers5

Two multisets

Segment tree

GNU Order Statistic Trees

Linked