262

I need to take the first N rows for each group, ordered by custom column.

Given the following table:

db=# SELECT * FROM xxx;
 id | section_id | name
----+------------+------
  1 |          1 | A
  2 |          1 | B
  3 |          1 | C
  4 |          1 | D
  5 |          2 | E
  6 |          2 | F
  7 |          3 | G
  8 |          2 | H
(8 rows)

I need the first 2 rows (ordered by name) for each section_id, i.e. a result similar to:

 id | section_id | name
----+------------+------
  1 |          1 | A
  2 |          1 | B
  5 |          2 | E
  6 |          2 | F
  7 |          3 | G
(5 rows)

I am using PostgreSQL 8.3.5.

Kouber Saparev
  • 7,637
  • 2
  • 29
  • 26

6 Answers6

385

New solution (PostgreSQL 8.4)

SELECT
  * 
FROM (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY section_id ORDER BY name) AS r,
    t.*
  FROM
    xxx t) x
WHERE
  x.r <= 2;
ngspkinga
  • 421
  • 5
  • 16
Dave
  • 3,866
  • 1
  • 15
  • 2
  • 11
    This works with PostgreSQL 8.4 too (window functions start with 8.4). – Bruno Mar 02 '12 at 15:48
  • 6
    Awesome! It works flawlessly. I am curious though, is there a way to do this with `group by`? – NurShomik Oct 19 '16 at 18:06
  • 8
    For those who works with like millions rows and seeks for really performant way to do this - poshest's answer is the way to go. Just dont forget to spice ti up with proper indexing. – Diligent Key Presser Jun 20 '19 at 08:46
  • This works in mySQL 8.0.24, fast condensation of 6M rows of associated email addresses to five in each category (no sorting or criteria, just needed five email addresses. A small number of categories (companies) had several thousand addresses.) – wistlo May 12 '21 at 19:08
93

Since v9.3 you can do a lateral join

select distinct t_outer.section_id, t_top.id, t_top.name from t t_outer
join lateral (
    select * from t t_inner
    where t_inner.section_id = t_outer.section_id
    order by t_inner.name
    limit 2
) t_top on true
order by t_outer.section_id;

It might be faster but, of course, you should test performance specifically on your data and use case.

poshest
  • 4,157
  • 2
  • 26
  • 37
  • 10
    Very cryptic solution IMO, specially with those names, but a good one. – villasv May 08 '17 at 21:18
  • 3
    This solution with LATERAL JOIN **might be** significantly faster than above one with windowed function (in some cases) if you have index by `t_inner.name` column – Artur Rashitov Aug 07 '17 at 15:42
  • The query is easier to understand if it does not contain the self-join. In that case `distinct` is not needed. An example is shown in the link poshest posted. – gillesB Dec 04 '18 at 10:41
  • 1
    Dude, this is mindlowing. 120ms instead of 9sec yielded with "ROW_NUMBER" solution. Thank you! – Diligent Key Presser Jun 20 '19 at 08:50
  • How can we select all columns of t_top. The t table contains a json column and I get "could not identify equality operator for type json postgres" error when I select `distinct t_outer.section_id, t_top.*` – suat May 10 '20 at 11:35
  • @suat use jsonb, which has an equality operator from v9.4 onwards. Or maybe [this approach](https://stackoverflow.com/a/31877016/2036135) will work? I'm not sure whether group by will work with this lateral join formula. Do let us know! – poshest May 12 '20 at 08:16
  • Running a variant of this lateral join on a table with 2 million rows put my DB's CPU at 100% for 12 hours. YMMV. – Max Rosett Aug 19 '21 at 15:36
  • this solution got `SQL Error [53000]: ERROR: insufficient memory reserved for statement`, but `ROW_NUMBER` solution worked – yurenchen Mar 06 '22 at 23:20
  • Instead of `SELECT DISTINCT`, make use of the join condition to remove the duplicates created: `t_top ON _outer.id = t_top.id`. This adds a "subquery scan" but removes a sort and unique operation. – Bluenix Jul 06 '23 at 14:02
12

Here's another solution (PostgreSQL <= 8.3).

SELECT
  *
FROM
  xxx a
WHERE (
  SELECT
    COUNT(*)
  FROM
    xxx
  WHERE
    section_id = a.section_id
  AND
    name <= a.name
) <= 2
Kouber Saparev
  • 7,637
  • 2
  • 29
  • 26
12

A lateral join is the way to go, but you should do a nested query first to improve performance on large tables.

SELECT t_limited.*
FROM (
        SELECT DISTINCT section_id
        FROM t
    ) t_groups
    JOIN LATERAL (
        SELECT *
        FROM t t_all
        WHERE t_all.section_id = t_groups.section_id
        ORDER BY t_all.name
        LIMIT 2
    ) t_limited ON true

Without the nested select distinct, the join lateral runs for every line in the table, even though the section_id is often duplicated. With the nested select distinct, the join lateral runs once and only once for each distinct section_id.

David Skinner
  • 121
  • 1
  • 3
  • 1
    I tried this out, and it is much faster than other solutions. I'm discarding my previous solution based on Window Functions in favour to this one – marcopeg Dec 21 '22 at 07:27
2
SELECT  x.*
FROM    (
        SELECT  section_id,
                COALESCE
                (
                (
                SELECT  xi
                FROM    xxx xi
                WHERE   xi.section_id = xo.section_id
                ORDER BY
                        name, id
                OFFSET 1 LIMIT 1
                ),
                (
                SELECT  xi
                FROM    xxx xi
                WHERE   xi.section_id = xo.section_id
                ORDER BY 
                        name DESC, id DESC
                LIMIT 1
                )
                ) AS mlast
        FROM    (
                SELECT  DISTINCT section_id
                FROM    xxx
                ) xo
        ) xoo
JOIN    xxx x
ON      x.section_id = xoo.section_id
        AND (x.name, x.id) <= ((mlast).name, (mlast).id)
Quassnoi
  • 413,100
  • 91
  • 616
  • 614
  • The query is very close to the one I need, except that it is not showing sections with less than 2 rows, i.e. the row with ID=7 isn't returned. Otherwise I like your approach. – Kouber Saparev Jul 14 '09 at 15:29
  • Thank you, I just came to the same solution with COALESCE, but you were faster. :-) – Kouber Saparev Jul 14 '09 at 15:41
  • Actually the last JOIN sub-clause could be simplified to: ... AND x.id <= (mlast).id as the ID have already been chosen according to the name field, no? – Kouber Saparev Jul 14 '09 at 15:47
  • @Kouber: in your example the `name`'s and `id`'s are sorted in same order, so you won't see it. Make the names in reverse order and you will see that these queries yield different results. – Quassnoi Jul 14 '09 at 16:12
2
        -- ranking without WINDOW functions
-- EXPLAIN ANALYZE
WITH rnk AS (
        SELECT x1.id
        , COUNT(x2.id) AS rnk
        FROM xxx x1
        LEFT JOIN xxx x2 ON x1.section_id = x2.section_id AND x2.name <= x1.name
        GROUP BY x1.id
        )
SELECT this.*
FROM xxx this
JOIN rnk ON rnk.id = this.id
WHERE rnk.rnk <=2
ORDER BY this.section_id, rnk.rnk
        ;

        -- The same without using a CTE
-- EXPLAIN ANALYZE
SELECT this.*
FROM xxx this
JOIN ( SELECT x1.id
        , COUNT(x2.id) AS rnk
        FROM xxx x1
        LEFT JOIN xxx x2 ON x1.section_id = x2.section_id AND x2.name <= x1.name
        GROUP BY x1.id
        ) rnk
ON rnk.id = this.id
WHERE rnk.rnk <=2
ORDER BY this.section_id, rnk.rnk
        ;
wildplasser
  • 43,142
  • 8
  • 66
  • 109
  • CTEs and Window functions were introduced with the same version, so I don't see the benefit of the first solution. –  Dec 07 '12 at 20:56
  • 1
    The post is three years old. Besides, there may still be implementations that lack them (nudge nudge say no more). It could also be considered an exercise in old-fashoned querybuilding. (though CTEs are not very old-fashoned) – wildplasser Dec 07 '12 at 21:00
  • The post is tagged "postgresql" and the PostgreSQL version that introduced CTEs also introduced windowing functions. Hence my comment (I did see it's that old - and PG 8.3 did have neither) –  Dec 07 '12 at 21:01
  • The post mentions 8.3.5, and I believe they were introduced in 8.4. Besides: it is also good to know about alternative scenarios, IMHO. – wildplasser Dec 07 '12 at 21:03
  • That's exactly what I mean: 8.3 neither had CTEs nor windowing functions. So the first solution won't work on 8.3 –  Dec 07 '12 at 21:05
  • I know, but the post is also tagged 'SQL' which implies "any mix of features", so this just adds other ways to perform the same task. (I confess I often use CTEs for constructing queries, mostly because the syntax is clearer to the human eye) – wildplasser Dec 07 '12 at 21:09