26

Those of you that have read my previous questions know about my work at understanding and implementing quicksort and quickselect, as well as some other basic algorithms.

Quickselect is used to calculate the kth smallest element in an unsorted list, and this concept can also be used to find the median in an unsorted list.

This time, I need aid in devising an efficient technique to calculate the running median, because quickselect isn't a good choice as it needs to re-calculate every time the list changes. Because quickselect has to restart everytime, it can't take advantage of previous calculations done, so I'm looking for a different algorithm that's similar (possibly) but is more efficient in the area of running medians.

Edge
  • 2,456
  • 6
  • 32
  • 57
  • This can be done in linear time by using partition from quick sort algo but has worst time n^2. Pick random point in your collection as pivot and move the other elems so that elems smaller than pivot are on the left and larger or equal are on the right. If pivot is in the middle it is the median, if not go to the chunk that has the median (the larger size chunk). Repeat. Another algo that guarantees linear time it median of medians described in CLRS and I believe also on wikipedia. Look those up. – Adrian Nov 19 '14 at 18:27

6 Answers6

51

The streaming median is computed using two heaps. All the numbers less than or equal to the current median are in the left heap, which is arranged so that the maximum number is at the root of the heap. All the numbers greater than or equal to the current median are in the right heap, which is arranged so that the minimum number is at the root of the heap. Note that numbers equal to the current median can be in either heap. The count of numbers in the two heaps never differs by more than 1.

When the process begins the two heaps are initially empty. The first number in the input sequence is added to one of the heaps, it doesn’t matter which, and returned as the first streaming median. The second number in the input sequence is then added to the other heap, if the root of the right heap is less than the root of the left heap the two heaps are swapped, and the average of the two numbers is returned as the second streaming median.

Then the main algorithm begins. Each subsequent number in the input sequence is compared to the current median, and added to the left heap if it is less than the current median or to the right heap if it is greater than the current median; if the input number is equal to the current median, it is added to whichever heap has the smaller count, or to either heap arbitrarily if they have the same count. If that causes the counts of the two heaps to differ by more than 1, the root of the larger heap is removed and inserted in the smaller heap. Then the current median is computed as the root of the larger heap, if they differ in count, or the average of the roots of the two heaps, if they are the same size.

Code in Scheme and Python is available at my blog.

user448810
  • 17,381
  • 4
  • 34
  • 59
  • 1
    Is there any code for implementation with C++? Thanks for the reply by the way, appreciated. I like the verbose explanation. – Edge Jun 07 '12 at 11:42
  • Also, does your idea only work on sorted lists, or unsorted lists as well? – Edge Jun 07 '12 at 11:43
  • @Andrew, have you looked at boost accumulators? – Chisholm Jun 07 '12 at 12:10
  • I know nothing of boost functions. – Edge Jun 07 '12 at 12:22
  • Andrew: The input need not be sorted. The heaps do the sorting internally. – user448810 Jun 07 '12 at 12:36
  • I'm having a lot of trouble implementing this in C++. I don't really know where to start, the python code on your blog seems very different. – Edge Jun 10 '12 at 02:15
  • There's lots of function references to pop and push, but no implementation of them. – Edge Jun 10 '12 at 02:16
  • For the Scheme solution, the top of the second page of the blog entry has three links to different implementations of priority queues, using three different algorithms. All of the Python comments import priority queues from Python's standard library. And C++ provides a priority queue in its STL, which you may decide to use instead of writing your own priority queue. – user448810 Jun 10 '12 at 15:53
  • 4
    Adding elements to the system is OK. But how removing old elements would work? – Notinlist Nov 13 '15 at 15:57
  • Implementation in Go https://github.com/JaderDias/movingmedian – Jader Dias Oct 04 '17 at 11:10
19

The Jeff McClintock running median estimate. Requires keeping only two values. This example iterates over an array of sampled values (CPU consumption). Seems to converge relatively quickly (about 100 samples) to an estimate of the median. The idea is at each iteration the median inches toward the input signal at a constant rate. The rate depends on what magnitude you estimate the median to be. I use the average as an estimate of the magnitude of the median, to determines the size of each increment of the median. If you need your median accurate to about 1%, use a step-size of 0.01 * the average.

float median = 0.0f;
float average = 0.0f;

// for each sample
{
    average += ( sample - average ) * 0.1f; // rough running average.
    median += _copysign( average * 0.01, sample - median );
}

enter image description here

Jeff McClintock
  • 1,242
  • 10
  • 27
  • 2
    While this solution is very efficient, beware of the following caveats: 1) convergence rate depends on signal amplitude (compare step responses with different offsets and amplitudes), therefore does not converge against signal near zero! 2) for near constant input signal this estimate introduces jitter with amplitude of `average * 0.01` and frequency of sample rate 3) deflects on short impulses (which a median originally does not, thus being popular as pepper and noise filter) – orzechow Jun 06 '17 at 12:02
  • Using the rolling standard deviation instead of the rolling mean to scale the step-increment may be an interesting direction for log-normally distributed data (which accounts for many/most naturally arising processes). Using a measure of variability instead of mean will address the issue of jitter raised by orzechow. Unfortunately, variance is not robust, so jitter may be re-introduced by large transient outliers. It might then be interesting to stack filters, which I would expect to have the effect of increasing the effective window width. – sircolinton Jul 05 '19 at 16:46
6

One solution would be to maintain an order statistic tree, inserting each element of the sequence in turn, then compute the median of the elements in the tree.

This would take O(lg n) time per insertion and O(lg n) time per median, for a total of O(n lg n) time, plus O(n) space.

Fred Foo
  • 355,277
  • 75
  • 744
  • 836
  • Is that tree type good for this purpose? I haven't heard of it before. – Edge Jun 07 '12 at 11:28
  • 2
    Order statistic trees allow indexing, i.e. getting the k'th smallest element of a sequence, in logarithmic time. – Fred Foo Jun 07 '12 at 11:29
  • Will this work with an unsorted list? – Edge Jun 07 '12 at 11:31
  • Yes, you just insert the items into the tree and they will get sorted. An order statistic tree is an augmented BST. – Fred Foo Jun 07 '12 at 11:33
  • I notice the page you linked contains pseudo code, not actual code. I understand basic BST's, but not much more, and implementing that tree would take a while following pseudo code. – Edge Jun 07 '12 at 11:34
  • Actually, nevermind. Looks pretty simple to me. Are there any other references that discuss more indepth how the order stat tree works? – Edge Jun 07 '12 at 11:40
  • @Andrew: https://en.wikipedia.org/wiki/Introduction_to_Algorithms – Fred Foo Jun 07 '12 at 11:47
1

Here is a C++ balanced tree structure that provides the ability to query by index in the sorted list. Since it maintains all values in sorted order, this is not be quite as efficient as the two-heaps approach, but it offers some additional flexibility. For example, it could also give you a running quartile.

template <typename T>
class Node
{
public:
    T key;
    Node* left;
    Node* right;
    size_t size;

    Node(T k) : key(k)
    {
        isolate();
    }

    ~Node()
    {
        delete(left);
        delete(right);
    }

    void isolate()
    {
        left = NULL;
        right = NULL;
        size = 1;
    }

    void recount()
    {
        size = 1 + (left ? left->size : 0) + (right ? right->size : 0);
    }

    Node<T>* rotateLeft()
    {
        Node<T>* c = right;
        Node<T>* gc = right->left;
        right = gc;
        c->left = this;
        recount();
        c->recount();
        return c;
    }

    Node<T>* rotateRight()
    {
        Node<T>* c = left;
        Node<T>* gc = left->right;
        left = gc;
        c->right = this;
        recount();
        c->recount();
        return c;
    }

    Node<T>* balance()
    {
        size_t lcount = left ? left->size : 0;
        size_t rcount = right ? right->size : 0;
        if((lcount + 1) * 2 < (rcount + 1))
        {
            size_t lcount2 = right->left ? right->left->size : 0;
            size_t rcount2 = right->right ? right->right->size : 0;
            if(lcount2 > rcount2)
                right = right->rotateRight();
            return rotateLeft();
        }
        else if((rcount + 1) * 2 <= (lcount + 1))
        {
            size_t lcount2 = left->left ? left->left->size : 0;
            size_t rcount2 = left->right ? left->right->size : 0;
            if(lcount2 < rcount2)
                left = left->rotateLeft();
            return rotateRight();
        }
        else
        {
            recount();
            return this;
        }
    }

    Node<T>* insert(Node<T>* newNode)
    {
        if(newNode->key < key)
        {
            if(left)
                left = left->insert(newNode);
            else
                left = newNode;
        }
        else
        {
            if(right)
                right = right->insert(newNode);
            else
                right = newNode;
        }
        return balance();
    }

    Node<T>* get(size_t index)
    {
        size_t lcount = left ? left->size : 0;
        if(index < lcount)
            return left->get(index);
        else if(index > lcount)
            return right ? right->get(index - lcount - 1) : NULL;
        else
            return this;
    }

    Node<T>* find(T k, size_t start, size_t* outIndex)
    {
        if(k < key)
            return left ? left->find(k, start, outIndex) : NULL;
        else if(k > key)
            return right ? right->find(k, left ? start + left->size + 1 : start + 1, outIndex) : NULL;
        else
        {
            if(outIndex)
                *outIndex = start + (left ? left->size : 0);
            return this;
        }
    }

    Node<T>* remove_by_index(size_t index, Node<T>** outNode)
    {
        size_t lcount = left ? left->size : 0;
        if(index < lcount)
            left = left->remove_by_index(index, outNode);
        else if(index > lcount)
            right = right->remove_by_index(index - lcount - 1, outNode);
        else
        {
            *outNode = this;
            size_t rcount = right ? right->size : 0;
            if(lcount < rcount)
                return left ? right->insert(left) : right;
            else
                return right ? left->insert(right) : left;
        }
        return balance();
    }

    Node<T>* remove_by_value(T k, Node<T>** outNode)
    {
        if(k < key)
        {
            if(!left)
                throw "not found";
            left = left->remove_by_value(k, outNode);
        }
        else if(k > key)
        {
            if(!right)
                throw "not found";
            right = right->remove_by_value(k, outNode);
        }
        else
        {
            *outNode = this;
            size_t lcount = left ? left->size : 0;
            size_t rcount = right ? right->size : 0;
            if(lcount < rcount)
                return left ? right->insert(left) : right;
            else
                return right ? left->insert(right) : left;
        }
        return balance();
    }
};

template <typename T>
class MyReasonablyEfficientRunningSortedIndexedCollection
{
private:
    Node<T>* root;
    Node<T>* spare;

public:
    MyReasonablyEfficientRunningSortedIndexedCollection() : root(NULL), spare(NULL)
    {
    }

    ~MyReasonablyEfficientRunningSortedIndexedCollection()
    {
        delete(root);
        delete(spare);
    }

    void insert(T key)
    {
        if(spare)
            spare->key = key;
        else
            spare = new Node<T>(key);
        if(root)
            root = root->insert(spare);
        else
            root = spare;
        spare = NULL;
    }

    void drop_by_index(size_t index)
    {
        if(!root || index >= root->size)
            throw "out of range";
        delete(spare);
        root = root->remove_by_index(index, &spare);
        spare->isolate();
    }

    void drop_by_value(T key)
    {
        if(!root)
            throw "out of range";
        delete(spare);
        root = root->remove_by_value(key, &spare);
        spare->isolate();
    }

    T get(size_t index)
    {
        if(!root || index >= root->size)
            throw "out of range";
        return root->get(index)->key;
    }

    size_t find(T key)
    {
        size_t outIndex;
        Node<T>* node = root ? root->find(key, 0, &outIndex) : NULL;
        if(node)
            return outIndex;
        else
            throw "not found";
    }

    size_t size()
    {
        return root ? root->size : 0;
    }
};
Mike Gashler
  • 609
  • 7
  • 12
1

Algorithm of rolling window median:

median is a sorted array where you take from it the middle value.

simple rolling implementation is with a queue(dqueue) and a sorted_array (any implementation, binary tree, skiparray).

d_queue is an array where you can push to tail and shift (pop) from the front of the array.

sorted_array is an array where you insert by order at position found using binary search.

I used a queue (first-in-first-out array) to track order of added values to know which items to remove from the median array, when they the queue is longer than the wanted size. to fall off elements by date time or some running index, it is possible to add another queue and check the first element is too old, and decide if to remove first value from both queues.

To calculate a median efficiently I use a sorted array technique. it is when you insert new items into its sorted place, so the array is always sorted.

  1. The insert:

    • Insert at ordered place in the sorted_array,
    • and push a value into a queue.
  2. The remove:

    • If d_queue first element is off the window, or if in an another queue you can have with indexes, the index is too old, then:
      • remove the first item from d_queue(s),
      • and binary search for it in the sorted array and remove it.
  3. To have the median:

    • Use the value(s) in the middle of the sorted_array.
    • If sorted_array length is even use the item in the middle.
    • If sorted_array length is odd use an average of two items in the middle.
Shimon Doodkin
  • 4,310
  • 34
  • 37
0
#include<cstdio>
#include<iostream>
#include<queue>
#include <vector>         
#include <functional>

typedef priority_queue<unsigned int> type_H_low;
typedef priority_queue<unsigned int, std::vector<unsigned int>, std::greater<unsigned int> > type_H_high;

size_t signum(int left, int right) {
    if (left == right){
        return 0;
    }
    return (left < right)?-1:1;
}

void get_median( unsigned int x_i, unsigned int &m, type_H_low *l, type_H_high *r) {

    switch (signum( l->size(), r->size() )) {
    case 1: // There are more elements in left (max) heap
        if (x_i < m) {
            r->push(l->top());
            l->pop();
            l->push(x_i);
        } else {
            r->push(x_i);
        }
        break;

    case 0: // The left and right heaps contain same number of elements
        if (x_i < m){
            l->push(x_i);
        } else {
            r->push(x_i);
        }
        break;

    case -1: // There are more elements in right (min) heap
        if (x_i < m){
            l->push(x_i);
        } else {
            l->push(r->top());
            r->pop();
            r->push(x_i);
        }
        break;
    }

    if (l->size() == r->size()){
        m = l->top();
    } else if (l->size() > r->size()){
        m = l->top();
    } else {
        m = r->top();
    }

    return;
}

void print_median(vector<unsigned int> v) {
    unsigned int median = 0;
    long int sum = 0;
    type_H_low  H_low;
    type_H_high H_high;

    for (vector<unsigned int>::iterator x_i = v.begin(); x_i != v.end(); x_i++) {
        get_median(*x_i, median, &H_low, &H_high);
        std::cout << median << std::endl;
    }
}
  • 2
    Hello, welcome to SO. Your answer contains only code. It would be better if you could also add some commentary to explain what it does and how. Can you please [edit] your answer and add it? Thank you! – Fabio says Reinstate Monica Jun 28 '17 at 23:19