Most Efficient way to compute the 99th percentile of a data set

Question

I have 100 Integers in my database. I sort them in ascending order. Right now for the 99th percentile I am taking the 99th number after sorting.

after a given time t, a new number come into the database and an older number gets discarded. The current code just take the 100 integer and sort them all over again.

Since there is 99 number that are shared By the set of original 100 integers and the set of 100 integers after time t. Is there a more efficient ways of calculating the 99th percentile, 95th percentile, 90th percentile and etc?

PS:All this is done under MySQL database

If you are using a database, i.e. MySQL, I do not think you will not be able to increase efficiency beyond using an index. Others may have better ideas. If you code your own sorted data structure you code optimize inserts and deletes using a binary search lookup. — mba12, Sep 08 '16 at 21:44
Sorry I forgot to mention everything is done under MySQL database. — Laughing Day, Sep 08 '16 at 21:46
(@mba12: `I do not think you will not be able to …` - please try and avoid multiple negations.) — greybeard, Sep 08 '16 at 21:49
@greybeard Thanks for the catch, that was an editing mistake, meant to say "I do not think you will be able to..." — mba12, Sep 09 '16 at 15:45

score 0 · Answer 1 · answered Sep 08 '16 at 22:26

0

If your data are random distributed you could try guessing the position by assuming a linear distribution.

guessPosition = newnumber*(max-min)/100

And then make a gallop search from that point out.

And when found insert it at the correct position.

answered Sep 08 '16 at 22:26

Surt

15,501
3
23
39

How do I do this in MySql? – Laughing Day Sep 09 '16 at 19:20

score 0 · Accepted Answer · answered Sep 08 '16 at 23:29

0

Let's call N the size of your array A (here N = 100) and you're looking for the K-th smallest element (after some modification requests).

The easiest solution is probably a kind of modified insertion sort: you keep a (sorted) array of the N-K+1 largest elements (let's call it B).

Discard an element e: walk through B (e.g. while B[i] < e)(*). If B[i] = e, shift all elements < i to the right.
Insert an element e: get the lower index i such that B[i] > e. Shift all elements >= i to the right and set B[i] := e.
Get the K-th smaller element: return B[0].

Time complexity: O(N-K) per request.

(*) Actually you could speed up the search step using binary search, but it won't change the overall time complexity.

If N-K is very large, it would be interesting to use binary trees instead (with a O(log(N-K)) time complexity per request). But given the actual size of your data sets (and your programming language) it won't be "profit-making".

answered Sep 08 '16 at 23:29

md5

23,373
3
44
93

I need to use MySQL though – Laughing Day Sep 09 '16 at 19:37
@LaughingDay: I don't know MySQL, but it looks like there is a concept of [loops](https://dev.mysql.com/doc/refman/5.7/en/loop.html) and you can emulate arrays using [temporary tables](http://stackoverflow.com/questions/12176709/how-can-i-simulate-an-array-variable-in-mysql). The first solution is just ~10 lines long in most popular programming languages, it shouldn't be so hard to code in MySQL either. – md5 Sep 10 '16 at 16:16
What if the dataset is Huge and the adding and removing of value is frequent. Then keeping a temporary table each time would impact the performance and take up a lot of memory no? – Laughing Day Sep 12 '16 at 20:36
@LaughingDay: Theorically the complexity is quite low. The table is just `N-K+1` elements long (e.g. in your case its size is 2!). As far as time is concerns, it really depends on the implementation of the DB... (though once again in your case you will just loop over 2 elements) The only solution is to benchmark it (and compare with the sorting solution). – md5 Sep 12 '16 at 21:01

score 0 · Answer 3 · answered Jul 13 '21 at 07:04

So, insert into the normal table, and also add a trigger to insert into an extra, sorted table. Every time you insert into the extra table, add the new element, then using the index should be fast to find the smallest (or largest) element. Drop that element. Now either re-compute the new percentile if the number of items (K) is small. Or perhaps keep the sum of the elements stored somewhere, and subtract the discarded value and add the added value. Then you both have the sum (without iterating the whole list), and the number of elements total should also be quick to get from the DB. Should be log(N-K) ish time. I think this was a Google interview question (minus the DB part).

Most Efficient way to compute the 99th percentile of a data set

3 Answers3