37

In an algorithm I have to calculate the 75th percentile of a data set whenever I add a value. Right now I am doing this:

  1. Get value x
  2. Insert x in an already sorted array at the back
  3. swap x down until the array is sorted
  4. Read the element at position array[array.size * 3/4]

Point 3 is O(n), and the rest is O(1), but this is still quite slow, especially if the array gets larger. Is there any way to optimize this?

UPDATE

Thanks Nikita! Since I am using C++ this is the solution easiest to implement. Here is the code:

template<class T>
class IterativePercentile {
public:
  /// Percentile has to be in range [0, 1(
  IterativePercentile(double percentile)
    : _percentile(percentile)
  { }

  // Adds a number in O(log(n))
  void add(const T& x) {
    if (_lower.empty() || x <= _lower.front()) {
      _lower.push_back(x);
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
    } else {
      _upper.push_back(x);
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
    }

    unsigned size_lower = (unsigned)((_lower.size() + _upper.size()) * _percentile) + 1;
    if (_lower.size() > size_lower) {
      // lower to upper
      std::pop_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.push_back(_lower.back());
      std::push_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.pop_back();
    } else if (_lower.size() < size_lower) {
      // upper to lower
      std::pop_heap(_upper.begin(), _upper.end(), std::greater<T>());
      _lower.push_back(_upper.back());
      std::push_heap(_lower.begin(), _lower.end(), std::less<T>());
      _upper.pop_back();
    }            
  }

  /// Access the percentile in O(1)
  const T& get() const {
    return _lower.front();
  }

  void clear() {
    _lower.clear();
    _upper.clear();
  }

private:
  double _percentile;
  std::vector<T> _lower;
  std::vector<T> _upper;
};
martinus
  • 17,736
  • 15
  • 72
  • 92
  • 2
    Nice, I had a similar question at an interview recently. Nikita already gave my answer. – Alexandru Sep 17 '10 at 20:49
  • 1
    @Alexandru: Similar != Same :-) I believe the heap solution is not required here. It might work for this: http://stackoverflow.com/questions/2213707/finding-an-appropriate-data-structure/, but I think it is a mis-application here. –  Sep 18 '10 at 00:04
  • I think there is undefined behavior in: `if (_lower.empty() || x <= _lower.front()) {` as the order of evaluation is not defined. – davide Oct 18 '17 at 11:16
  • @davide The order of evaluation is well defined, if `_lower.empty()` returns true the right side is not evaluated. – martinus Oct 18 '17 at 14:00
  • @martinus You're right, operators `&&` and `||` are an exception in that they guarantee the order of evaluation. The caveat is that their overloaded counterparts invert or don't guarantee the order of evaluation, depending on wether they are defined as methods, but that's not the case here. I'll reference [this excellent answer on SO](https://stackoverflow.com/a/628554/1012773) on the subject. – davide Oct 18 '17 at 18:54

6 Answers6

38

You can do it with two heaps. Not sure if there's a less 'contrived' solution, but this one provides O(logn) time complexity and heaps are also included in standard libraries of most programming languages.

First heap (heap A) contains smallest 75% elements, another heap (heap B) - the rest (biggest 25%). First one has biggest element on the top, second one - smallest.

  1. Adding element.

See if new element x is <= max(A). If it is, add it to heap A, otherwise - to heap B.
Now, if we added x to heap A and it became too big (holds more than 75% of elements), we need to remove biggest element from A (O(logn)) and add it to heap B (also O(logn)).
Similar if heap B became too big.

  1. Finding "0.75 median"

Just take the largest element from A (or smallest from B). Requires O(logn) or O(1) time, depending on heap implementation.

edit
As Dolphin noted, we need to specify precisely how big each heap should be for every n (if we want precise answer). For example, if size(A) = floor(n * 0.75) and size(B) is the rest, then, for every n > 0, array[array.size * 3/4] = min(B).

Jackson Tale
  • 25,428
  • 34
  • 149
  • 271
Nikita Rybak
  • 67,365
  • 22
  • 157
  • 181
  • but how do you determine if heap A became too big? – Hari Menon Sep 17 '10 at 19:47
  • @Raze2dust Heap A should hold approx 75% of elements. If it's size goes beyond that, it became too big. – Nikita Rybak Sep 17 '10 at 19:48
  • @Raze2dust If you mean, "how to get heap size", it's an O(1) operation :) – Nikita Rybak Sep 17 '10 at 19:51
  • 1
    I think this idea will work, but I think a few changes are necessary. First, one of the heaps should always have the item you are looking for on it. This way you cann figure out what size each heap should be for a given number of elements `heap A=floor(n*.75) and heap B=ceil(n*.25)` (in this case). Next, when you add an item, determine which heap needs to grow. If heap A needs to grow and the item is less than the the top of B, add it to A. Otherwise remove the top of B, add it to A, then add the new item to B. (The remove then add would be more efficient as a modify). – Dolphin Sep 17 '10 at 20:17
  • @Dolphin Sorry, I don't completely understand your suggestions. Are you saying that algorithm has mistake? Or it can become simpler or asymptotically faster? – Nikita Rybak Sep 17 '10 at 20:41
  • great idea! to find out where to add a number, I think you can do it this way: given `size` is total size of A+B. When adding a number, calculate `(int)(size * 0.75)` and `(int)((size+1)*0.75)`. If both numbers are the same, grow A, otherwise grow B. – martinus Sep 17 '10 at 20:51
  • @martinus Don't forget, any element in B should be >= any element in A. So, if you choose where to add depending on the size, you'll need afterwards to compare max(A) and min(B) and exchange them if second one is smaller. – Nikita Rybak Sep 17 '10 at 21:12
  • 1
    @Nikita - no, just a couple tweaks. Defining which heap should grow makes the add operation slightly simpler (your add can do 3 O(logn) operations (add, remove, add). My suggestion is two (modify, add) in the worst case. It doesn't really matter which heap you choose, but picking the small heap to always have the item will keep the size of the heaps closer, for a (probably insignificant) performance gain. – Dolphin Sep 17 '10 at 21:16
  • Nice solution! Since you only remove max from heap A and min from heap B, maybe you should mention that heap A is a max-heap and heap B is a min-heap. – Eyal Schneider Sep 17 '10 at 22:44
  • @Nikita Ah yeah, now I know why they say sleep is necessary.. :D – Hari Menon Sep 18 '10 at 14:28
  • @NikitaRybak This is IMHO the best solution. However note it is O(N) not O(lgN) since you will be paying O(1) for every element. Basically this is the minimum you can do since you have to at least do see the value of every element and that is O(N) – ntg Jul 17 '20 at 14:17
17

A simple Order Statistics Tree is enough for this.

A balanced version of this tree supports O(logn) time insert/delete and access by Rank. So you not only get the 75% percentile, but also the 66% or 50% or whatever you need without having to change your code.

If you access the 75% percentile frequently, but only insert less frequently, you can always cache the 75% percentile element during an insert/delete operation.

Most standard implementations (like Java's TreeMap) are order statistic trees.

Tautvydas
  • 2,027
  • 3
  • 25
  • 38
  • 1
    +1 for a useful technique. But you have a mistake: Java's TreeSet (or Map) won't give you tools necessary to iterate from tree root down to leafs. IIRC, STL version too. You'll have to write your own balanced tree or hack someone else's code. Hardly enjoyable. – Nikita Rybak Sep 18 '10 at 00:13
  • 1
    +1 - But you can't index a Java `TreeSet` by rank. You _can_ use Java's `TreeSet` if the values will not repeat; you just need to keep track of your current 75th percentile and the number of items to the left and to the right. When you add something, place it into the set and update the left/right numbers. If you now have too many on the right, use `higher` to get the next one; if too many on the left, use `lower` to get the previous; if you're okay, don't do anything. If the values repeat, you'll have to create a map from key to some collection (list?), and then a similar trick works. – Rex Kerr Sep 18 '10 at 02:05
  • @Nikita: I believe TreeMap has it! Look at the comments to this answer:http://stackoverflow.com/questions/3071497/list-or-container-o1-ish-insertion-deletion-performance-with-array-semantics/3071566#3071566. @Rex, I was talking of TreeMap. Of course I haven't used Java in a while. –  Sep 18 '10 at 02:36
  • But Rex's idea should work (although it's not terribly simple to implement) – Nikita Rybak Sep 18 '10 at 02:46
  • @Nikita: I am not claiming that you _have_ to traverse the tree yourself. I am claiming that the data structure provides API for accessing/inserting/deleting by position. Anyway I am not so sure about TreeMap now... –  Sep 18 '10 at 06:06
  • Ive tried it with a tree, but the heap implementation is several times faster for my use case. – martinus Sep 22 '10 at 07:17
  • @martinus: Did you try caching? Anyway, glad this forum worked out for you :-) –  Sep 22 '10 at 13:41
  • For me caching is no use for me since after each insert I call one get() operation. I think the heap solution is faster because it can use two arrays as the backend – martinus Sep 22 '10 at 14:36
  • @Martinus: I see. If your 75% is fixed, I agree the heap will be faster: you have partitioned it based on the 75% element. So insertions will be faster etc. –  Sep 22 '10 at 16:12
3

If you can do with an approximate answer, you can use a histogram instead of keeping entire values in memory.

For each new value, add it to the appropriate bin. Calculate percentile 75th by traversing bins and summing counts until 75% of the population size is reached. Percentile value is between bin's (which you stopped at) low bound to high bound.

This will provide O(B) complexity where B is the count of bins, which is range_size/bin_size. (use bin_size appropriate to your user case).

I have implemented this logic in a JVM library: https://github.com/IBM/HBPE which you can use as a reference.

dux2
  • 1,770
  • 1
  • 21
  • 27
-2

If you have a known set of values, following will be very fast:

Create a large array of integers (even bytes will work) with number of elements equal to maximum value of your data. For example, if the maximum value of t is 100,000 create an array

int[] index = new int[100000]; // 400kb

Now iterate over the entire set of values, as

for each (int t : set_of_values) {
  index[t]++;
}

// You can do a try catch on ArrayOutOfBounds just in case :)

Now calculate percentile as

int sum = 0, i = 0;
while (sum < 0.9*set_of_values.length) {
  sum += index[i++];
}

return i;

You can also consider using a TreeMap instead of array, if the values don't confirm to these restrictions.

  • This makes insertion O(1), but it makes finding the 75th percentile element O(M), where M is the highest value. M is probably much larger than N. (Also, note that the OP was using double-precision float values, so there's no hope of representing them with a bitmap (or repeat-count array) of reasonable size). So the overall time complexity is O(NM), for the list of 75th percentiles from every partial list. This would be interesting if the range of possible values was quite small, but not helpful here. I wouldn't call it "very fast", though, compared to the two-heap trick. – Peter Cordes Oct 26 '15 at 11:52
  • I don't get the downvotes for this answer. Even if the values are float, if their distribution is known, careful binning can yield very accurate results. If you can get $M$ low enough, it can be really fast in comparison to O(n log(n)), specially considering that the operations are really simple and fast (float adding, indexing). Also, since adding a number is O(1), if you don't need to get the updated value of the percentile every time you add a number, you save a lot of log(n) lookups on the heap. Since the OP was looking for speed, this is worth considering. – Pepe Mandioca Dec 24 '19 at 18:39
-2

Here is a javaScript solution . Copy-paste it in browser console and it works . $scores contains the List of scores and , $percentilegives the n-th percentile of the list . So 75th percentile is 76.8 and 99 percentile is 87.9.

function get_percentile($percentile, $array) {
    $array = $array.sort();
    $index = ($percentile/100) * $array.length;
    if (Math.floor($index) === $index) {
         $result = ($array[$index-1] + $array[$index])/2;
    }
    else {
        $result = $array[Math.floor($index)];
    }
    return $result;
}

$scores = [22.3, 32.4, 12.1, 54.6, 76.8, 87.3, 54.6, 45.5, 87.9];

get_percentile(75, $scores);
get_percentile(90, $scores);
sapy
  • 8,952
  • 7
  • 49
  • 60
-2

You can use binary search to do find the correct position in O(log n). However, shifting the array up is still O(n).

Matthew Flaschen
  • 278,309
  • 50
  • 514
  • 539