31

I'm trying to rank a subset of data within a table but I think I am doing something wrong. I cannot find much information about the rank() feature for postgres, maybe I'm looking in the wrong place. Either way:

I'd like to know the rank of an id that falls within a cluster of a table based on a date. My query is as follows:

select cluster_id,feed_id,pub_date,rank 
from (select feed_id,pub_date,cluster_id,rank() 
    over (order by pub_date asc) from url_info) 
as bar where cluster_id = 9876 and feed_id = 1234;

I'm modeling this after the following stackoverflow post: postgres rank

The reason I think I am doing something wrong is that there are only 39 rows in url_info that are in cluster_id 9876 and this query ran for 10 minutes and never came back. (actually re-ran it for quite a while and it returned no results, yet there is a row in cluster 9876 for id 1234) I'm expecting this will tell me something like "id 1234 was 5th for the criteria given). It will return a relative rank according to my query constraints, correct?

This is postgres 8.4 btw.

Community
  • 1
  • 1
WildBill
  • 9,143
  • 15
  • 63
  • 87
  • 3
    You want the rank relative to what? all records (this is what your query above asks and probably why it takes too long)? the set of records selected by the predicate? some other grouping? The `over` clause should specify a `partition by` if you don't want to rank over all rows. [Here's the tutorial on window functions](http://www.postgresql.org/docs/8.4/static/tutorial-window.html) – dbenhur Apr 22 '12 at 05:23
  • I want to rank relative to the pub_date. Perhaps I can explain a bit better: url_info has thousands of urls in it. 39 of them compse cluster 9876. By selecting only url's that are a member of 9876 (url can ONLY be a member of one cluster), I want to rank the url's (that compose cluster 9876) in order based on pub_date. Do I still use partitions based on window functions for this? I looked at the URL you sent and it looks like this is referring to ranking items after I compute some value, which I am not doing here.... – WildBill Apr 22 '12 at 05:34

2 Answers2

44

By placing the rank() function in the subselect and not specifying a PARTITION BY in the over clause or any predicate in that subselect, your query is asking to produce a rank over the entire url_info table ordered by pub_date. This is likely why it ran so long as to rank over all of url_info, Pg must sort the entire table by pub_date, which will take a while if the table is very large.

It appears you want to generate a rank for just the set of records selected by the where clause, in which case, all you need do is eliminate the subselect and the rank function is implicitly over the set of records matching that predicate.

select 
  cluster_id
 ,feed_id
 ,pub_date
 ,rank() over (order by pub_date asc) as rank
from url_info
where cluster_id = 9876 and feed_id = 1234;

If what you really wanted was the rank within the cluster, regardless of the feed_id, you can rank in a subselect which filters to that cluster:

select ranked.*
from (
  select 
    cluster_id
   ,feed_id
   ,pub_date
   ,rank() over (order by pub_date asc) as rank
  from url_info
  where cluster_id = 9876
) as ranked
where feed_id = 1234;
dbenhur
  • 20,008
  • 4
  • 48
  • 45
  • 1
    `rank() over (order by pub_date asc) as rank` is redundant, as the default column name is the function's name – isapir Dec 28 '17 at 00:00
  • 2
    @isapir Maybe so, but that's not a strong promise; from the [docs](https://www.postgresql.org/docs/current/static/sql-select.html#SQL-SELECT-LIST): "In more complex cases a function or type name **may be** used, or the system may fall back on a generated name such as ?column?" – dbenhur Jan 03 '18 at 22:43
8

Sharing another example of DENSE_RANK() of PostgreSQL. Find top 3 students sample query. Reference taken from this blog:

Create a table with sample data:

CREATE TABLE tbl_Students
(
    StudID INT
    ,StudName CHARACTER VARYING
    ,TotalMark INT
);

INSERT INTO tbl_Students 
VALUES 
(1,'Anvesh',88),(2,'Neevan',78)
,(3,'Roy',90),(4,'Mahi',88)
,(5,'Maria',81),(6,'Jenny',90);

Using DENSE_RANK(), Calculate RANK of students:

;WITH cteStud AS
(
    SELECT 
        StudName
        ,Totalmark
        ,DENSE_RANK() OVER (ORDER BY TotalMark DESC) AS StudRank
    FROM tbl_Students
)
SELECT 
    StudName
    ,Totalmark
    ,StudRank
FROM cteStud 
WHERE StudRank <= 3;

The Result:

studname | totalmark | studrank
----------+-----------+----------
 Roy      |        90 |        1
 Jenny    |        90 |        1
 Anvesh   |        88 |        2
 Mahi     |        88 |        2
 Maria    |        81 |        3
(5 rows)
Anvesh
  • 7,103
  • 3
  • 45
  • 43
  • 4
    I think it is worth pointing out that *dense_rank()* produces ranks without gaps (1,1,2,3,4,5,5,5,6,7,...), whereas *rank()* produces ranks with gaps (1,1,3,4,5,5,7,...) – lionbigcat Jun 25 '20 at 14:37