How to keep track of the median of changing vector of integers?

Question

Trying to solve a problem at http://www.hackerearth.com/problem/algorithm/sum-of-medians-1/ and thought to solve it using the multiset as it may contain duplicate values. I tried to code as follows:

#include<iostream>
#include<set>
#include<algorithm>
using namespace std;
int main()
{
  int n,k,med_sum=0,p;
  cin>>n;
  multiset<int> m;
  multiset<int>::iterator itr;
  for(int i=0;i<n;i++)
  {
    cin>>k;
    m.insert(k);
    p=k;
    if(p<=k)
      itr--;
    else
      itr++;
    med_sum+=*itr;
  }
  cout<<med_sum<<endl;
  return 0;
}

Sample Input:
n=5
10
5
1
2
15
Sample Output: 27
Explanation:
m(1)=median of [10]=10
m(2)=median of [10 5]=5
m(3)=median of [10 5 1]=5
m(4)=median of [10 5 1 2 15]=5
med_sum=10+5+5+2+5=27

@nosid http://www.hackerearth.com/problem/algorithm/sum-of-medians-1/ — Satish Patel, Oct 04 '13 at 18:04
@CarlNorum I am fairly new to STL and as I know that multiset stores the elements in sorted order allowing duplicates also. Correct me as well if anything going wrong. — Satish Patel, Oct 04 '13 at 18:14
You never initialize `itr`, so its hard to see how that code can work. It crashes when I run it... — Chris Dodd, Oct 04 '13 at 18:29
@CarlNorum: In C++, `std::multiset` is ordered and there's also `std::unordered_multiset`. A bit strange, but the ordered one came first. — aschepler, Oct 04 '13 at 18:32
Use two heaps (priority queues): http://stackoverflow.com/a/10657732/56778 — Jim Mischel, Oct 04 '13 at 21:41
possible duplicate of [find median in O(log n)](http://stackoverflow.com/questions/7842347/find-median-in-olog-n) — Jerry Coffin, Oct 05 '13 at 00:13

Yakk - Adam Nevraumont · Accepted Answer · 2013-10-04T18:30:25.443

An easy way to do this is to maintain two sorted containers, one lower than the median, one greater.

When you find a new element, see what container to insert it into (so that all elements of lower are always less than or equal to all elements of upper), then rebalance counts so that the median is "in the gap" between them.

Yours sort of does this, but defines the lower range to be [.begin(), it) and the upper to be [it, .end()). I'd make them separate containers to reduce the amount of state you need to keep in your head to understand the code, unless performance was particularly important.

Maintain two sorted containers, low and high, with the following invariants:

low.size() is the same as high.size() or 1 larger
Every element of low is less than or equal to every element of high. Then the median of low union high is low.last().

Assuming you have such a pair of containers, if you wanted to add an element x, I would first maintain invariant 2 -- if low.empty() or x <= low.last(), stick it in low, otherwise stick it in high.

Next, maintain invariant 1: if high has more elements than low, remove the lowest element from high and add it to low. If low has 2 more elements than high, remove the highest element from low and add it to high.

If we started with a low and high that obeyed the above two invariants, then after the above steps we still have a low and high that obeys these two invariants. And when the above two invariants hold, the median is low.last()!

@SatishPatel done. Short of writing out the answer for you, and maybe a corner case, the above fully describes what you need to do. — Yakk - Adam Nevraumont, Oct 04 '13 at 18:31
This "streaming median" algorithm is correct. I implement it at [my blog](http://programmingpraxis.com/2012/05/29/streaming-median/) using priority queues as the containers. — user448810, Oct 07 '13 at 13:18

Chris Dodd · Answer 2 · 2013-10-04T19:55:20.010

What you appear to be trying to do (incorrectly, with your posted code) is keep the values in a multiset with an iterator pointing at the median value, and adjust that pointer each time you add to the set based on whether the value you are adding is above or below the median.

That's a good design and probably the fastest way of solving this problem. However, it won't work with std::multiset because of a limitation of std::multiset that causes problems with one crucial corner case -- when the new value being added to the set is equal to the old median, you have no way of knowing whether it was inserted before or after the old median in the multiset. So you can't know how to adjust the median pointer in that case.

So if you want to solve the problem this way, you'll need to implement your own rbtree or other set abstraction in a way that can give you this crucial piece of information. Alternately, when inserting a value equal to the old median, you can refind the median from scratch (an expensive operation, but should be rare).

Edit

hints for the OP to make the code work with C++11 multisets:

you need to initialize itr when you insert the first value into the multiset
you only need to adjust itr after inserting on the "wrong" side of itr. You can tell which side is "wrong" based on whether the multiset size is odd or even.
since a value equal to the old median will always be inserted after itr you want the test
if (k < *itr) not <=

multiset preserves the order of insertion of equivalent elements (as of C++11) — Cubbi, Oct 04 '13 at 19:26
@Cubbi: If so, then it could be used in this way, and the OP need only fix his code to initialize/modify `itr` correctly. — Chris Dodd, Oct 04 '13 at 19:43

SirGuy · Answer 3 · 2013-10-05T04:37:11.033

How about just using a std::vector? Keep the values sorted by inserting where std::lower_bound tells you to.

    std::vector<int> p;
    int k, med_sum;
    while(cin >> k)
    {
      p.insert(std::lower_bound(p.begin(), p.end(), k), k);
      med_sum += p[p.size()/2];
    }
    std::cout << med_sum << std::endl;

EDIT
The advantage of using the vector is that the algorithm is very clear and concise (only 3 lines of logic) and the cache locality of the vector's memory and the bulk memory allocations make it at least competitive vs. a map/set or list implementation even if its insertion is O(n).

For comparison I wrote up an (untested) implementation using the multiset:

    std::multiset<int> p;
    int k, med_sum;
    cin >> med_sum;
    p.insert(med_sum);
    std::multiset<int>::iterator itr = p.begin();
    while(cin >> k)
    {
      if(p.size() % 2 == 0)
      {
         if(k < *itr) ++itr;
      }
      else
      {
         if(k < *itr) --itr;
         else ++itr;
      }
      med_sum += *itr;
    }
    cout << med_sum << endl;

The 7 lines of logic is not bad, but it's definitely less clear what is happening and I can't tell just by looking at it if it's right. Note this algorithm only works in C++11 where multimap::insert inserts duplicate values at the top end of the range (of duplicates).

All in all I think the vector implementation is the better way to go, unless necessity and measurements say to go with the multiset

@JimMischel Agreed, the advantage here is that the algorithm takes 3 lines. — SirGuy, Oct 05 '13 at 04:18
@JimMischel expanded my answer in defense of the `vector` implementation — SirGuy, Oct 05 '13 at 04:37
@guygreer It works well but I would like to know whether using muktiset is faster or vector? I think multiset would promise to have a great performance. — Satish Patel, Oct 05 '13 at 09:40
@SatishPatel Then implement both methods and run them in larger and larger loops and time how fast each one runs. — SirGuy, Oct 05 '13 at 16:18

How to keep track of the median of changing vector of integers?

3 Answers3