Select count of max items, get rank and percentile

Question

I have a table with multiple entries per person_id column. I'm storing a score, a category_id, and a created column as well. So every time the person completes the indicated category I store a single record for them.

Now I'm trying to write a query that says: Just using the most recently created score for each person, find out how many people scored worse than I did for a specific category. I'm basically doing a percentile calculation here. So to get the total number scores I'm doing:

select count(distinct person_id) from performances where category_id = 7;

I'm not sure how to write the second query though, that finds out how many people did worse than me. Is this somewhere I'd use that "OVER PARTITION" type windowing function?

A question like this needs a *clear definition* of the problem, your version of Postgres, and a table definition (`\d tbl` in psql), so we can see data types, what is unique, what can be NULL, etc. Example data and expected outcome also go a long way. It's not that bad as questions go on SO lately, but it is still unclear and ambiguous. — Erwin Brandstetter, Aug 09 '14 at 01:07

score 3 · Answer 1 · edited May 23 '17 at 12:29

What you actually asked

Just using the most recently created score for each person

... translates to:

SELECT DISTINCT ON (person_id) *
FROM   performances
ORDER  BY person_id, created DESC;

Do not add a WHERE condition here (yet) or you get different (incorrect) results. Details for DISTINCT ON:

Select first row in each GROUP BY group?

find out how many people scored worse than I did for a specific category.

... translates to:

SELECT *
     , dense_rank() OVER w AS worse_than_me
     , ntile(100)   OVER w AS percentile
FROM  (
   SELECT DISTINCT ON (person_id) *
   FROM   performances
   ORDER  BY person_id, created DESC
  ) p
WINDOW w AS (PARTITION BY category_id ORDER BY score);

Assuming "worse" means a lower score.
The window function dense_rank() is the right tool that answers the question "How many people?" - as opposed to rank() which answers "How many distinct scores?".

ntile(100) over the same window definition gives you the ready percentile as integer, 100 meaning in the top 1 %, 99 meaning in the 2nd best % etc.

However, ntile() returns, per documentation:

integer ranging from 1 to the argument value, dividing the partition as equally as possible

That means, if you should have less than 100 rows in your partition (like you commented), multiple by 100.0 / count(*) to scale the number. A "percentile" is not the most useful statistic for just a hand full of rows in a set, it's typically used on big sets.

What you did not ask I

But quite possibly meant to ask:

"How does each person rank in the category (s)he finished last among all other results in that category?"

Assuming unique entries for (person_id, category_id), or you also have to define how to deal with multiple results per person in the same category (including self).

SELECT *
FROM  (
   SELECT DISTINCT ON (person_id) *
   FROM   performances
   ORDER  BY person_id, created DESC
   ) pers
JOIN (
   SELECT person_id, category_id
        , dense_rank() OVER w AS worse_than_me
        , ntile(100)   OVER w AS percentile
   FROM   performances
   WINDOW w AS (PARTITION BY category_id ORDER BY score)
   ) rnk f USING (person_id, category_id);

In the subquery pers we distill the last entry per person (the one of interest).
In the subquery rnk we get ranking and percentile compared to all other entries.
JOIN with the USING clause, and you got a ready SELECT list without duplicate columns.

What you did not ask II

but would also make more sense if there can be multiple entries per (person_id, category_id):

"Get the rank for the latest score of each person in each category compared to all other latest personal scores in the same category."

SELECT *
     , dense_rank() OVER w AS worse_than_me
     , ntile(100)   OVER w AS percentile
FROM  (
   SELECT DISTINCT ON (person_id, category_id) *
   FROM   performances
   ORDER  BY person_id, category_id, created DESC;
  ) p
WINDOW w AS (PARTITION BY category_id ORDER BY score);

Unclear / ambiguous questions lead to arbitrary results. The first step to a solution is to define the task clearly.

I'm not sure what that ntile is really giving me. A percentile is calculated as the total number of scores less than your divided by the total candidates times 100. The query you showed is just giving me a number from 1 to 5 if I have 5 candidates. — Gargoyle, Aug 10 '14 at 00:23
@Gargoyle: I added some explanation for that. For just 5 candidates, a "percentile" doesn't seem like a useful statistic ... — Erwin Brandstetter, Aug 10 '14 at 23:45
Thanks, Erwin. I do have way more than 5 data points...I was just using test data to understand your example easier. — Gargoyle, Aug 11 '14 at 01:21

Clodoaldo Neto · Accepted Answer · 2014-08-08T19:22:45.563

1

select
    person_id,
    count(*) over() as total_person,
    rank() over(order by score desc) as score_rank
from (
    select distinct on (person_id) *
    from score
    where category_id = 7
    order by person_id, created desc
) s

Check rank, dense_rank, percent_rank, ntile, and cume_dist:

http://www.postgresql.org/docs/current/static/functions-window.html

distinct on returns a single row from each of the person_ids. Using the order by clause it is possible to choose each one.

edited Aug 08 '14 at 19:22

answered Aug 08 '14 at 19:17

Clodoaldo Neto

118,695
26
233
260

The ```distinct on``` trick is awesome. That rank() over thing is going to take more reading/playing to understand. Thanks. – Gargoyle Aug 08 '14 at 21:21

Select count of max items, get rank and percentile

2 Answers2

What you actually asked

What you did not ask I

What you did not ask II