1

I am pondering the solution to the following problem: I have a large array of integers (in even tougher case a stream of integers) and I want to convert that array into and array of medians, tht is it position corresponds to median of array [0..i].

Now, my brute force approach would be for each subarray, sort it and then select middle element. That is O(n^2 log n), as each n subarrays must be sorted in N log N time. I could probably lower the time to N^2 using some counting mechanism, but is N^2 the best that can be done?

Bober02
  • 15,034
  • 31
  • 92
  • 178
  • The answers (specifically the references included in them) to this (closed) question might help you http://stackoverflow.com/questions/3440905/find-the-median-from-a-stream-of-integers – High Performance Mark Jan 11 '13 at 11:38
  • Using selection algorithm instead of sorting one can trivially improve the complexity to `O(n^2)`, though my gut tells me there is a more efficient way, probably `O(nlogn)` one. – amit Jan 11 '13 at 11:51

3 Answers3

3

Inserting a single item into an already sorted tree-like structure, like e.g. a red-black tree, will take O(log n). If you maintain the number of descendants for every node, then finding the median can be done in O(log n) as well. Do so for all n elements of your stream, and you have an O(n log n) algorithm.

The data structure and median computation could look something like this:

struct TreeNode {
  enum { RED, BLACK } color;
  size_t numElements;
  int value;
  TreeNode* left, right;
} *root;

TreeNode* treeSelect(TreeNode *top, size_t count) {
  if (top->left != NULL) {
    if (count < top->left->numElements)
      return treeSelect(top->left, count)
    count -= top->left->numElements;
  }
  if (count == 0)
    return top;
  return treeSelect(top->right, count - 1);
}

TreeNode* treeMedian() {
  return treeSelect(root, root->numElements / 2);
}

The other operations are those usually used for red-black trees, although you can skip those for removal of nodes. You can adjust them to work with duplicate elements. The general rule here is that when an element to be inserted is equal to the element of the current node, you may insert into any child tree. And when balancing the tree, the order of duplicate keys should be maintained, so that you maintain the order of the attached subtrees as well. But unless I miss something, balancing will work without value comparisons in any case, so once you've inserted duplicates, you are done. If you expect really many duplicate values, you might use a multimap-like approach instead, with a counter in each node.

MvG
  • 57,380
  • 22
  • 148
  • 276
  • SO the tree have to be balanced all the time to make this work? Also, what if there are duplicate elements? – Bober02 Jan 11 '13 at 14:26
  • and how do you find the median given a tree? – Bober02 Jan 11 '13 at 14:30
  • @Bober02, the tree has to be balanced within the limits of a Red-Black tree in order to guarantee the limit on the worst case time complexity. I updated my answer to discuss duplicate elements and sketch the median-finding process. – MvG Jan 11 '13 at 16:50
0

It can be done in O(nlogn) with the following approach:

Use a skip list (or a B+ tree) to maintain the stream of data.
Also maintain a pointer to the current median.

In each iteration, let the number of nodes be n (before inserting the just arrived element), the median (value) be m and the new value you see be x.

  1. Insert x into the skip list
  2. If n%2 ==0 (the index of the median should increase):
    if x < m.value: no changes need, the index of m is also increased
    else: set m <- m.next
  3. else: (the index of the median should remain the same)
    if x < m.value: set m <- m.previous //because the index of m was increased
    else: no changes needed. //the index of m did not increase
  4. m.value is the median for the next element

finding previous/next is O(1), inserting x is O(logn) - total complexity is O(nlogn) with O(n) additional space.

Note: Special care should be added if the stream can contain duplicate elements, and the skiplist should have a deterministic behavior for it - i.e. insert always an the last possible index

amit
  • 175,853
  • 27
  • 231
  • 333
0

Use some fast BST like AVL or Splay. You can simply modify structure so each node will can get median for its subtree in log N time. Especially you can get median for all items in tree.

Now you are doing something like this:

Create empty BST
foreach n in stream:
    Insert n to BST
    Get median from BST.root
Ari
  • 3,101
  • 2
  • 27
  • 49
  • You usually don't want the *median* from a subtree, but instead the element at a given index which will be the median for the whole tree. – MvG Jan 11 '13 at 12:56
  • RIght, so the main requirement is the fact that the tree has to be balanced all the time? – Bober02 Jan 11 '13 at 14:24
  • Also, what about duplicates in the list? – Bober02 Jan 11 '13 at 14:28
  • @Bober02 What about them? You can have duplicates in tries like AVL or Splay. Or if you don't want to have duplicates in tree simply check if given element exists in tree. – Ari Jan 15 '13 at 14:19