Select n largest count by category in Redshift

Question

I would like to select the X most common pairs per group in a table. Let's consider the following table:

+-------------+-----------+
| identifier  |    city   |
+-------------+-----------+
| AB          |  Seattle  |
| AC          |  Seattle  |
| AC          |  Seattle  |
| AB          |  Seattle  |
| AD          |  Seattle  |
| AB          |  Chicago  |
| AB          |  Chicago  |
| AD          |  Chicago  |
| AD          |  Chicago  |
| BC          |  Chicago  |
+-------------+-----------+

Seattle, AB occurs 2x
Seattle, AC occurs 2x
Seattle, AD occurs 1x
Chicago, AB occurs 2x
Chicago, AD occurs 2x
Chicago, BC occurs 1x

If I would like to select the 2 most commons per city, the result should be:

+-------------+-----------+
| identifier  |    city   |
+-------------+-----------+
| AB          |  Seattle  |
| AC          |  Seattle  |
| AB          |  Chicago  |
| AD          |  Chicago  |
+-------------+-----------+

Any help is appreciated. Thanks, Benni

Possible duplicate of [Get top n records for each group of grouped results](https://stackoverflow.com/questions/12113699/get-top-n-records-for-each-group-of-grouped-results) — mato, Jul 28 '17 at 00:47

Vamsi Prabhala · Accepted Answer · 2017-07-28T00:48:10.700

7

You can use count in row number to order the number of appearances per city combination and choose the first two.

select city,identifier 
from (
select city,identifier
,row_number() over(partition by city order by count(*) desc,identifier) as rnum_cnt
from tbl
group by city,identifier
) t
where rnum_cnt<=2

edited Jul 28 '17 at 00:48

answered Jul 28 '17 at 00:41

Vamsi Prabhala

48,685
4
36
58

you can't use `count(*)` inside the partition, at least in Redshift, count should be done in a subquery – AlexYes Jul 28 '17 at 08:36
@AlexYes It seems you can. The query in the answer gave me correct result. Additionally, [documentation](http://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_synopsis.html) says that an expression is allowed in the order list. – Dmitrii I. Jul 28 '17 at 09:47
@DmitriiI. interesting! I was thinking only about scalar expressions when I saw the docs, I didn't know Redshift is that smart:) thanks! – AlexYes Jul 28 '17 at 10:05

score 0 · Answer 2 · answered Jul 28 '17 at 10:03

Using the WITH clause:

with
    _counts as (
        select
            identifier,
            city,
            count(*) as city_id_count
        from
            t1
        group by
            identifier,
            city
    ),

    _counts_and_max as (
        select
            identifier,
            city,
            city_id_count,
            max(city_id_count) over (partition by city) as city_max_count
        from
            _counts
    )

    select
        identifier,
        city
    from
        _counts_and_max
    where
        city_id_count = city_max_count
    ;

Select n largest count by category in Redshift

2 Answers2