2

I would like to select the X most common pairs per group in a table. Let's consider the following table:

+-------------+-----------+
| identifier  |    city   |
+-------------+-----------+
| AB          |  Seattle  |
| AC          |  Seattle  |
| AC          |  Seattle  |
| AB          |  Seattle  |
| AD          |  Seattle  |
| AB          |  Chicago  |
| AB          |  Chicago  |
| AD          |  Chicago  |
| AD          |  Chicago  |
| BC          |  Chicago  |
+-------------+-----------+
  • Seattle, AB occurs 2x
  • Seattle, AC occurs 2x
  • Seattle, AD occurs 1x
  • Chicago, AB occurs 2x
  • Chicago, AD occurs 2x
  • Chicago, BC occurs 1x

If I would like to select the 2 most commons per city, the result should be:

+-------------+-----------+
| identifier  |    city   |
+-------------+-----------+
| AB          |  Seattle  |
| AC          |  Seattle  |
| AB          |  Chicago  |
| AD          |  Chicago  |
+-------------+-----------+

Any help is appreciated. Thanks, Benni

Benni
  • 357
  • 6
  • 18
  • Possible duplicate of [Get top n records for each group of grouped results](https://stackoverflow.com/questions/12113699/get-top-n-records-for-each-group-of-grouped-results) – mato Jul 28 '17 at 00:47

2 Answers2

7

You can use count in row number to order the number of appearances per city combination and choose the first two.

select city,identifier 
from (
select city,identifier
,row_number() over(partition by city order by count(*) desc,identifier) as rnum_cnt
from tbl
group by city,identifier
) t
where rnum_cnt<=2
Vamsi Prabhala
  • 48,685
  • 4
  • 36
  • 58
  • you can't use `count(*)` inside the partition, at least in Redshift, count should be done in a subquery – AlexYes Jul 28 '17 at 08:36
  • @AlexYes It seems you can. The query in the answer gave me correct result. Additionally, [documentation](http://docs.aws.amazon.com/redshift/latest/dg/r_Window_function_synopsis.html) says that an expression is allowed in the order list. – Dmitrii I. Jul 28 '17 at 09:47
  • @DmitriiI. interesting! I was thinking only about scalar expressions when I saw the docs, I didn't know Redshift is that smart:) thanks! – AlexYes Jul 28 '17 at 10:05
0

Using the WITH clause:

with
    _counts as (
        select
            identifier,
            city,
            count(*) as city_id_count
        from
            t1
        group by
            identifier,
            city
    ),

    _counts_and_max as (
        select
            identifier,
            city,
            city_id_count,
            max(city_id_count) over (partition by city) as city_max_count
        from
            _counts
    )

    select
        identifier,
        city
    from
        _counts_and_max
    where
        city_id_count = city_max_count
    ;
Dmitrii I.
  • 696
  • 7
  • 16