4

I created a program that finds the median of a list of numbers. The list of numbers is dynamic in that numbers can be removed and inserted (duplicate numbers can be entered) and during this time, the new median is re-evaluated and printed out.

I created this program using a multimap because

1) the benefit of it being already being sorted,
2) easy insertion, deletion, searching (since multimap implements binary search)
3) duplicate entries are allowed.

The constraints for the number of entries + deletions (represented as N) are: 0 < N <= 100,000.

The program I wrote works and prints out the correct median, but it isn't fast enough. I know that the unsorted_multimap is faster than multimap, but then the problem with unsorted_multimap is that I would have to sort it. I have to sort it because to find the median you need to have a sorted list. So my question is, would it be practical to use an unsorted_multimap and then quick sort the entries, or would that just be ridiculous? Would it be faster to just use a vector, quicksort the vector, and use a binary search? Or maybe I am forgetting some fabulous solution out there that I haven't even thought of.

Though I'm not new to C++, I will admit, that my skills with time-complexity are somewhat medicore.


The more I look at my own question, the more I'm beginning to think that just using a vector with quicksort and binary search would be better since the data structures basically already implement vectors.

Bo Persson
  • 90,663
  • 31
  • 146
  • 203
user1066524
  • 373
  • 2
  • 11
  • 24
  • 1
    I don't think a map is a good data structure for this problem. Best performance would probably come from a vector though I haven't compared the two. – evanmcdonnal Mar 14 '13 at 19:57
  • 1
    @evanmcdonnal, I think you might be right. I'm thinking vector might just be as fast with quicksort and binary search. – user1066524 Mar 14 '13 at 19:59
  • I think this question is a duplicate of: http://stackoverflow.com/questions/1387497/find-median-value-from-a-growing-set – Jose Luis Blanco Mar 14 '13 at 20:04
  • @JoseLuisBlanco: It's not exactly because the set may be shrinking. – ipc Mar 14 '13 at 20:08
  • well the vector was faster than the multimap, but it still isn't fast enough...which is weird. maybe i'm slowing it down in some other way. – user1066524 Mar 14 '13 at 21:48

5 Answers5

5

the more I look at my own question, the more I'm beginning to think that just using vector with quicksort and binary search would be better since the data structures basically already implement vectors.

If you have only few updates - use unsorted std::vector + std::nth_element algorithm which is O(N). You don't need full sorting which is O(N*ln(N)).

live demo of nth_element:

#include <algorithm>
#include <iterator>
#include <iostream>
#include <ostream>
#include <vector>

using namespace std;

template<typename RandomAccessIterator>
RandomAccessIterator median(RandomAccessIterator first,RandomAccessIterator last)
{
   RandomAccessIterator m = first + distance(first,last)/2; // handle even middle if needed
   nth_element(first,m,last);
   return m;
}

int main()
{
   vector<int> values = {5,1,2,4,3};
   cout << *median(begin(values),end(values)) << endl;
}

Output is:

3

If you have many updates and only removing from middle - use two heaps as comocomocomocomo suggests. If you would use fibonacci_heap - then you would also get O(N) removing from arbitary position (if don't have handle to it).

If you have many updates and need O(ln(N)) removing from arbitary places - then use two multisets as ipc suggests.

Community
  • 1
  • 1
Evgeny Panasyuk
  • 9,076
  • 1
  • 33
  • 54
4

If your purpose is to keep track of the median on the fly, as elements are inserted/removed, you should use a min-heap and a max-heap. Each one would contain one half of the elements... There was a related question a couple of days ago: How to implement a Median-heap

Though, if you need to search for specific values in order to remove elements, you still need some kind of map.

You said that it is slow. Are you iterating from the beginning of the map to the (N/2)'th element every time you need the median? You don't need to. You can keep track of the median by maintaining an iterator pointing to it at all times and a counter of the number of elements less than that one. Every time you insert/remove, compare the new/old element with the median and update both iterator and counter.

Another way of seeing it is as two multimaps containing half the elements each. One holds the elements less than the median (or equal) and the other holds those greater. The heaps do this more efficiently, but they don't support searches.

If you only need the median a few times you can use the "select" algorithm. It is described in Sedgewick's book. It takes O(n) time on average. It is similar to quick sort but it does not sort completely. It just partitions the array with random pivots until, eventually, it gets to "select" on one side the smaller m elements (m=(n+1)/2). Then you search for the greatest of those m elements, and this is the median.

Community
  • 1
  • 1
comocomocomocomo
  • 4,772
  • 2
  • 17
  • 17
2

Here is how you could implement that in O(log N) per update:

template <typename T>
class median_set {
public:
  std::multiset<T> below, above;

  // O(log N)
  void rebalance()
  {
    int diff = above.size() - below.size();
    if (diff > 0) {
      below.insert(*above.begin());
      above.erase(above.begin());
    } else if (diff < -1) {
      above.insert(*below.rbegin());
      below.erase(below.find(*below.rbegin()));
    }
  }

public:
  // O(1)
  bool empty() const { return below.empty() && above.empty(); }

  // O(1)
  T const& median() const
  {
    assert(!empty());
    return *below.rbegin();
  }

  // O(log N)
  void insert(T const& value)
  {
    if (!empty() && value > median())
      above.insert(value);
    else
      below.insert(value);
    rebalance();
  }

  // O(log N)
  void erase(T const& value)
  {
    if (value > median())
      above.erase(above.find(value));
    else
      below.erase(below.find(value));
    rebalance();
  }
};

(Work in action with tests)

The idea is the following:

  • Keep track of the values above and below the median in two sets
  • If a new value is added, add it to the corresponding set. Always ensure that the set below has exactly 0 or 1 more then the other
  • If a value is removed, remove it from the set and make sure that the condition still holds.

You can't use priority_queues because they won't let you remove one item.

ipc
  • 8,045
  • 29
  • 33
  • "You can't use priority_queues because they won't let you remove one item." - fibonacci_heap supports erasing of elements. It's operations are: top - O(1), push - O(1), erase - O(ln(N)). But searching for element to erase (if don't have handle to it) is O(N), so erasing would be O(N). http://www.boost.org/doc/libs/1_53_0/doc/html/heap/data_structures.html – Evgeny Panasyuk Mar 14 '13 at 21:00
2
Can any one help me what is Space and Time complexity of my following C# program with details.
//Passing Integer array to Find Extreme from that Integer Array
   public int extreme(int[] A)
        {
            int N = A.Length;
            if (N == 0)
            {
                return -1;
            }
            else
            {
                int average = CalculateAverage(A);
                return FindExtremes(A, average);
            }
        }
// Calaculate Average of integerArray
        private int CalculateAverage(int[] integerArray)
        {
            int sum = 0;
            foreach (int value in integerArray)
            {
                sum += value;
            }
            return Convert.ToInt32(sum / integerArray.Length);
        }
//Find Extreme from that Integer Array
        private int FindExtremes(int[] integerArray, int average) {
            int Index = -1; int ExtremeElement = integerArray[0];
            for (int i = 0; i < integerArray.Length; i++)
            {
                int absolute = Math.Abs(integerArray[i] - average);
                if (absolute > ExtremeElement)
                {
                    ExtremeElement = integerArray[i];
                    Index = i;
                }
            }
            return Index;
        }
Shankar Kamble
  • 2,983
  • 6
  • 24
  • 40
1

You are almost certainly better off using a vector. Possibly maintaining an auxiliary vector of indexes to be removed between median calculations so you can delete them in batches. New additions can also be put into an auxiliary vector, sorted, then merged in.

Steve
  • 1,215
  • 6
  • 11