What you actually asked
Just using the most recently created score for each person
... translates to:
SELECT DISTINCT ON (person_id) *
FROM performances
ORDER BY person_id, created DESC;
Do not add a WHERE
condition here (yet) or you get different (incorrect) results.
Details for DISTINCT ON
:
find out how many people scored worse than I did for a specific category.
... translates to:
SELECT *
, dense_rank() OVER w AS worse_than_me
, ntile(100) OVER w AS percentile
FROM (
SELECT DISTINCT ON (person_id) *
FROM performances
ORDER BY person_id, created DESC
) p
WINDOW w AS (PARTITION BY category_id ORDER BY score);
Assuming "worse" means a lower score.
The window function dense_rank()
is the right tool that answers the question "How many people?" - as opposed to rank()
which answers "How many distinct scores?".
ntile(100)
over the same window definition gives you the ready percentile as integer, 100
meaning in the top 1 %, 99
meaning in the 2nd best % etc.
However, ntile()
returns, per documentation:
integer ranging from 1 to the argument value, dividing the partition
as equally as possible
That means, if you should have less than 100 rows in your partition (like you commented), multiple by 100.0 / count(*)
to scale the number. A "percentile" is not the most useful statistic for just a hand full of rows in a set, it's typically used on big sets.
What you did not ask I
But quite possibly meant to ask:
"How does each person rank in the category (s)he finished last among all other results in that category?"
Assuming unique entries for (person_id, category_id)
, or you also have to define how to deal with multiple results per person in the same category (including self).
SELECT *
FROM (
SELECT DISTINCT ON (person_id) *
FROM performances
ORDER BY person_id, created DESC
) pers
JOIN (
SELECT person_id, category_id
, dense_rank() OVER w AS worse_than_me
, ntile(100) OVER w AS percentile
FROM performances
WINDOW w AS (PARTITION BY category_id ORDER BY score)
) rnk f USING (person_id, category_id);
- In the subquery
pers
we distill the last entry per person (the one of interest).
- In the subquery
rnk
we get ranking and percentile compared to all other entries.
JOIN
with the USING
clause, and you got a ready SELECT
list without duplicate columns.
What you did not ask II
but would also make more sense if there can be multiple entries per (person_id, category_id)
:
"Get the rank for the latest score of each person in each category compared to all other latest personal scores in the same category."
SELECT *
, dense_rank() OVER w AS worse_than_me
, ntile(100) OVER w AS percentile
FROM (
SELECT DISTINCT ON (person_id, category_id) *
FROM performances
ORDER BY person_id, category_id, created DESC;
) p
WINDOW w AS (PARTITION BY category_id ORDER BY score);
Unclear / ambiguous questions lead to arbitrary results. The first step to a solution is to define the task clearly.