Calculating percentile rank in MySQL

Question

I have a very big table of measurement data in MySQL and I need to compute the percentile rank for each and every one of these values. Oracle appears to have a function called percent_rank but I can't find anything similar for MySQL. Sure I could just brute-force it in Python which I use anyways to populate the table but I suspect that would be quite inefficient because one sample might have 200.000 observations.

Can you please explain exactly what you mean by percentile rank? — Assaf Lavie, Jun 29 '09 at 07:48
I made a Mysql function working for any percentile : http://stackoverflow.com/a/40266115/1662956 — dartaloufe, Oct 26 '16 at 15:25

score 20 · Answer 1 · answered Oct 25 '11 at 03:18

Here's a different approach that doesn't require a join. In my case (a table with 15,000+) rows, it runs in about 3 seconds. (The JOIN method takes an order of magnitude longer).

In the sample, assume that measure is the column on which you're calculating the percent rank, and id is just a row identifier (not required):

SELECT
    id,
    @prev := @curr as prev,
    @curr := measure as curr,
    @rank := IF(@prev > @curr, @rank+@ties, @rank) AS rank,
    @ties := IF(@prev = @curr, @ties+1, 1) AS ties,
    (1-@rank/@total) as percentrank
FROM
    mytable,
    (SELECT
        @curr := null,
        @prev := null,
        @rank := 0,
        @ties := 1,
        @total := count(*) from mytable where measure is not null
    ) b
WHERE
    measure is not null
ORDER BY
    measure DESC

Credit for this method goes to Shlomi Noach. He writes about it in detail here:

http://code.openark.org/blog/mysql/sql-ranking-without-self-join

I've tested this in MySQL and it works great; no idea about Oracle, SQLServer, etc.

Unfortunately this depends on the order of evaluation for user variables, which is undefined behavior. The first comment in that link quotes the MySQL manual: "The order of evaluation for user variables is undefined and may change based on the elements contained within a given query....The general rule is never to assign a value to a user variable in one part of a statement and use the same variable in some other part of the same statement. You might get the results you expect, but this is not guaranteed." Reference: http://dev.mysql.com/doc/refman/5.1/en/user-variables.html — rep, Jan 02 '15 at 20:12

Conor · Answer 2 · 2015-04-20T17:19:53.043

6

SELECT 
    c.id, c.score, ROUND(((@rank - rank) / @rank) * 100, 2) AS percentile_rank
FROM
    (SELECT 
    *,
        @prev:=@curr,
        @curr:=a.score,
        @rank:=IF(@prev = @curr, @rank, @rank + 1) AS rank
    FROM
        (SELECT id, score FROM mytable) AS a,
        (SELECT @curr:= null, @prev:= null, @rank:= 0) AS b
ORDER BY score DESC) AS c;

edited Apr 20 '15 at 17:19

answered Apr 20 '15 at 07:01

Conor

494
6
15

score 4 · Answer 3 · answered Jun 29 '09 at 07:58

4

there is no easy way to do this. see http://rpbouman.blogspot.com/2008/07/calculating-nth-percentile-in-mysql.html

answered Jun 29 '09 at 07:58

Nir Levy

4,613
2
34
47

What I'm looking for is actually the inverse of that i.e. given a number it should tell me its rank. I'm somewhat confident this would be easier in Oracle but unfortunately that isn't a possibility. – lhahne Jun 29 '09 at 09:47

score 3 · Accepted Answer · answered Aug 31 '09 at 06:09

This is a relatively ugly answer, and I feel guilty saying it. That said, it might help you with your issue.

One way to determine the percentage would be to count all of the rows, and count the number of rows that are greater than the number you provided. You can calculate either greater or less than and take the inverse as necessary.

Create an index on your number. total = select count(); less_equal = select count() where value > indexed_number;

The percentage would be something like: less_equal / total or (total - less_equal)/total

Make sure that both of them are using the index that you created. If they are not, tweak them until they are. The explain query should have "using index" in the right hand column. In the case of the select count(*) it should be using index for InnoDB and something like const for MyISAM. MyISAM will know this value at any time without having to calculate it.

If you needed to have the percentage stored in the database, you can use the setup from above for performance and then calculate the value for each row by using the second query as an inner select. The first query's value can be set as a constant.

Does this help?

Jacob

I actually tried that a few weeks ago and it was incredibly slow so I ended up calculating percentiles in python and putting the value in database. — lhahne, Sep 01 '09 at 06:29
You tried to use the select count(*) and select count(*) <= yourvalue? Did you confirm that both of them were being handled by an index that only had the columns you needed? If the solution had to touch the data rows at all, I would expect it to be one or two orders of magnitude slower. If the indexes included more than the columns needed or the memory configuration of MySQL was not setup right, it to be very slow. If so, this should have been fast. Roughly how much time is "incredibly slow"? Depending on the order of magnitude of the expected response, my answer could be unwholesomely slow. — TheJacobTaylor, Sep 01 '09 at 20:39
@TheJacobTaylor Correct answer but short on code. If you put a functional 'select distinct' type query up, you get my +1. Also, if you can fix this, you get a nice shiny +1 and check! ;)) http://stackoverflow.com/questions/13689434/update-all-rows-with-countdistinct-only-updates-first-row-the-rest-0 — , Dec 11 '12 at 18:35

score 3 · Answer 5 · answered Jan 28 '19 at 09:36

MySQL 8 finally introduced window functions, and among them, the PERCENT_RANK() function you were looking for. So, just write:

SELECT col, percent_rank() OVER (ORDER BY col)
FROM t
ORDER BY col

Your question mentions "percentiles", which are a slightly different thing. For completeness' sake, there are PERCENTILE_DISC and PERCENTILE_CONT inverse distribution functions in the SQL standard and in some RBDMS (Oracle, PostgreSQL, SQL Server, Teradata), but not in MySQL. With MySQL 8 and window functions, you can emulate PERCENTILE_DISC, however, again using the PERCENT_RANK and FIRST_VALUE window functions.

score 2 · Answer 6 · edited Apr 22 '13 at 15:07

If you're combining your SQL with a procedural language like PHP, you can do the following. This example breaks down excess flight block times into an airport, into their percentiles. Uses the LIMIT x,y clause in MySQL in combination with ORDER BY. Not very pretty, but does the job (sorry struggled with the formatting):

$startDt = "2011-01-01";
$endDt = "2011-02-28";
$arrPort= 'JFK';

$strSQL = "SELECT COUNT(*) as TotFlights FROM FIDS where depdt >= '$startDt' And depdt <= '$endDt' and ArrPort='$arrPort'";
if (!($queryResult = mysql_query($strSQL, $con)) ) {
    echo $strSQL . " FAILED\n"; echo mysql_error();
    exit(0);
}
$totFlights=0;
while($fltRow=mysql_fetch_array($queryResult)) {
    echo "Total Flights into " . $arrPort . " = " . $fltRow['TotFlights'];
    $totFlights = $fltRow['TotFlights'];

    /* 1906 flights. Percentile 90 = int(0.9 * 1906). */
    for ($x = 1; $x<=10; $x++) {
        $pctlPosn = $totFlights - intval( ($x/10) * $totFlights);
        echo "PCTL POSN for " . $x * 10 . " IS " . $pctlPosn . "\t";
        $pctlSQL = "SELECT  (ablk-sblk) as ExcessBlk from FIDS where ArrPort='" . $arrPort . "' order by ExcessBlk DESC limit " . $pctlPosn . ",1;";
        if (!($query2Result = mysql_query($pctlSQL, $con)) ) {
            echo $pctlSQL  . " FAILED\n";
            echo mysql_error();
            exit(0);
        }
        while ($pctlRow = mysql_fetch_array($query2Result)) {
            echo "Excess Block is :" . $pctlRow['ExcessBlk'] . "\n";
        }
    }
}

score 0 · Answer 7 · answered Aug 21 '09 at 08:39

To get the rank, I'd say you need to (left) outer join the table on itself something like :

select t1.name, t1.value, count(distinct isnull(t2.value,0))  
from table t1  
left join table t2  
on t1.value>t2.value  
group by t1.name, t1.value

For each row, you will count how many (if any) rows of the same table have an inferior value.

Note that I'm more familiar with sqlserver so the syntax might not be right. Also the distinct may not have the right behaviour for what you want to achieve. But that's the general idea.
Then to get the real percentile rank you will need to first get the number of values in a variable (or distinct values depending on the convention you want to take) and compute the percentile rank using the real rank given above.

score 0 · Answer 8 · answered Nov 15 '18 at 14:17

Suppose we have a sales table like :

user_id,units

then following query will give percentile of each user :

select a.user_id,a.units,
(sum(case when a.units >= b.units then 1 else 0 end )*100)/count(1) percentile
from sales a join sales b ;

Note that this will go for cross join so result in O(n2) complexity so can be considered as unoptimized solution but seems simple given we do not have any function in mysql version.

Calculating percentile rank in MySQL

8 Answers8

Linked