Grouped LIMIT in PostgreSQL: show the first N rows for each group?

Question

I need to take the first N rows for each group, ordered by custom column.

Given the following table:

db=# SELECT * FROM xxx;
 id | section_id | name
----+------------+------
  1 |          1 | A
  2 |          1 | B
  3 |          1 | C
  4 |          1 | D
  5 |          2 | E
  6 |          2 | F
  7 |          3 | G
  8 |          2 | H
(8 rows)

I need the first 2 rows (ordered by name) for each section_id, i.e. a result similar to:

 id | section_id | name
----+------------+------
  1 |          1 | A
  2 |          1 | B
  5 |          2 | E
  6 |          2 | F
  7 |          3 | G
(5 rows)

I am using PostgreSQL 8.3.5.

score 385 · Accepted Answer · edited Aug 11 '16 at 09:08

385

New solution (PostgreSQL 8.4)

SELECT
  * 
FROM (
  SELECT
    ROW_NUMBER() OVER (PARTITION BY section_id ORDER BY name) AS r,
    t.*
  FROM
    xxx t) x
WHERE
  x.r <= 2;

edited Aug 11 '16 at 09:08

ngspkinga

421
5
16

answered May 19 '11 at 19:55

Dave

3,866
1
15
2

11

This works with PostgreSQL 8.4 too (window functions start with 8.4). – Bruno Mar 02 '12 at 15:48
6

Awesome! It works flawlessly. I am curious though, is there a way to do this with `group by`? – NurShomik Oct 19 '16 at 18:06
8

For those who works with like millions rows and seeks for really performant way to do this - poshest's answer is the way to go. Just dont forget to spice ti up with proper indexing. – Diligent Key Presser Jun 20 '19 at 08:46
This works in mySQL 8.0.24, fast condensation of 6M rows of associated email addresses to five in each category (no sorting or criteria, just needed five email addresses. A small number of categories (companies) had several thousand addresses.) – wistlo May 12 '21 at 19:08

score 93 · Answer 2 · answered Jun 16 '16 at 14:25

93

Since v9.3 you can do a lateral join

select distinct t_outer.section_id, t_top.id, t_top.name from t t_outer
join lateral (
    select * from t t_inner
    where t_inner.section_id = t_outer.section_id
    order by t_inner.name
    limit 2
) t_top on true
order by t_outer.section_id;

It might be faster but, of course, you should test performance specifically on your data and use case.

answered Jun 16 '16 at 14:25

poshest

4,157
2
26
37

10

Very cryptic solution IMO, specially with those names, but a good one. – villasv May 08 '17 at 21:18
3

This solution with LATERAL JOIN **might be** significantly faster than above one with windowed function (in some cases) if you have index by `t_inner.name` column – Artur Rashitov Aug 07 '17 at 15:42
The query is easier to understand if it does not contain the self-join. In that case `distinct` is not needed. An example is shown in the link poshest posted. – gillesB Dec 04 '18 at 10:41
1

Dude, this is mindlowing. 120ms instead of 9sec yielded with "ROW_NUMBER" solution. Thank you! – Diligent Key Presser Jun 20 '19 at 08:50
How can we select all columns of t_top. The t table contains a json column and I get "could not identify equality operator for type json postgres" error when I select `distinct t_outer.section_id, t_top.*` – suat May 10 '20 at 11:35
@suat use jsonb, which has an equality operator from v9.4 onwards. Or maybe [this approach](https://stackoverflow.com/a/31877016/2036135) will work? I'm not sure whether group by will work with this lateral join formula. Do let us know! – poshest May 12 '20 at 08:16
Running a variant of this lateral join on a table with 2 million rows put my DB's CPU at 100% for 12 hours. YMMV. – Max Rosett Aug 19 '21 at 15:36
this solution got `SQL Error [53000]: ERROR: insufficient memory reserved for statement`, but `ROW_NUMBER` solution worked – yurenchen Mar 06 '22 at 23:20
Instead of `SELECT DISTINCT`, make use of the join condition to remove the duplicates created: `t_top ON _outer.id = t_top.id`. This adds a "subquery scan" but removes a sort and unique operation. – Bluenix Jul 06 '23 at 14:02

Kouber Saparev · Answer 3 · 2016-02-15T11:26:01.153

12

Here's another solution (PostgreSQL <= 8.3).

SELECT
  *
FROM
  xxx a
WHERE (
  SELECT
    COUNT(*)
  FROM
    xxx
  WHERE
    section_id = a.section_id
  AND
    name <= a.name
) <= 2

edited Feb 15 '16 at 11:26

answered Jul 17 '09 at 14:41

Kouber Saparev

7,637
2
29
26

score 12 · Answer 4 · answered Mar 24 '22 at 11:36

A lateral join is the way to go, but you should do a nested query first to improve performance on large tables.

SELECT t_limited.*
FROM (
        SELECT DISTINCT section_id
        FROM t
    ) t_groups
    JOIN LATERAL (
        SELECT *
        FROM t t_all
        WHERE t_all.section_id = t_groups.section_id
        ORDER BY t_all.name
        LIMIT 2
    ) t_limited ON true

Without the nested select distinct, the join lateral runs for every line in the table, even though the section_id is often duplicated. With the nested select distinct, the join lateral runs once and only once for each distinct section_id.

I tried this out, and it is much faster than other solutions. I'm discarding my previous solution based on Window Functions in favour to this one — marcopeg, Dec 21 '22 at 07:27

Quassnoi · Answer 5 · 2009-07-14T15:35:15.777

2

SELECT  x.*
FROM    (
        SELECT  section_id,
                COALESCE
                (
                (
                SELECT  xi
                FROM    xxx xi
                WHERE   xi.section_id = xo.section_id
                ORDER BY
                        name, id
                OFFSET 1 LIMIT 1
                ),
                (
                SELECT  xi
                FROM    xxx xi
                WHERE   xi.section_id = xo.section_id
                ORDER BY 
                        name DESC, id DESC
                LIMIT 1
                )
                ) AS mlast
        FROM    (
                SELECT  DISTINCT section_id
                FROM    xxx
                ) xo
        ) xoo
JOIN    xxx x
ON      x.section_id = xoo.section_id
        AND (x.name, x.id) <= ((mlast).name, (mlast).id)

edited Jul 14 '09 at 15:35

answered Jul 14 '09 at 10:48

Quassnoi

413,100
91
616
614

The query is very close to the one I need, except that it is not showing sections with less than 2 rows, i.e. the row with ID=7 isn't returned. Otherwise I like your approach. – Kouber Saparev Jul 14 '09 at 15:29
Thank you, I just came to the same solution with COALESCE, but you were faster. :-) – Kouber Saparev Jul 14 '09 at 15:41
Actually the last JOIN sub-clause could be simplified to: ... AND x.id <= (mlast).id as the ID have already been chosen according to the name field, no? – Kouber Saparev Jul 14 '09 at 15:47
@Kouber: in your example the `name`'s and `id`'s are sorted in same order, so you won't see it. Make the names in reverse order and you will see that these queries yield different results. – Quassnoi Jul 14 '09 at 16:12

score 2 · Answer 6 · answered Dec 07 '12 at 20:53

2

        -- ranking without WINDOW functions
-- EXPLAIN ANALYZE
WITH rnk AS (
        SELECT x1.id
        , COUNT(x2.id) AS rnk
        FROM xxx x1
        LEFT JOIN xxx x2 ON x1.section_id = x2.section_id AND x2.name <= x1.name
        GROUP BY x1.id
        )
SELECT this.*
FROM xxx this
JOIN rnk ON rnk.id = this.id
WHERE rnk.rnk <=2
ORDER BY this.section_id, rnk.rnk
        ;

        -- The same without using a CTE
-- EXPLAIN ANALYZE
SELECT this.*
FROM xxx this
JOIN ( SELECT x1.id
        , COUNT(x2.id) AS rnk
        FROM xxx x1
        LEFT JOIN xxx x2 ON x1.section_id = x2.section_id AND x2.name <= x1.name
        GROUP BY x1.id
        ) rnk
ON rnk.id = this.id
WHERE rnk.rnk <=2
ORDER BY this.section_id, rnk.rnk
        ;

answered Dec 07 '12 at 20:53

wildplasser

43,142
8
66
109

CTEs and Window functions were introduced with the same version, so I don't see the benefit of the first solution. – Dec 07 '12 at 20:56
1

The post is three years old. Besides, there may still be implementations that lack them (nudge nudge say no more). It could also be considered an exercise in old-fashoned querybuilding. (though CTEs are not very old-fashoned) – wildplasser Dec 07 '12 at 21:00
The post is tagged "postgresql" and the PostgreSQL version that introduced CTEs also introduced windowing functions. Hence my comment (I did see it's that old - and PG 8.3 did have neither) – Dec 07 '12 at 21:01
The post mentions 8.3.5, and I believe they were introduced in 8.4. Besides: it is also good to know about alternative scenarios, IMHO. – wildplasser Dec 07 '12 at 21:03
That's exactly what I mean: 8.3 neither had CTEs nor windowing functions. So the first solution won't work on 8.3 – Dec 07 '12 at 21:05
I know, but the post is also tagged 'SQL' which implies "any mix of features", so this just adds other ways to perform the same task. (I confess I often use CTEs for constructing queries, mostly because the syntax is clearer to the human eye) – wildplasser Dec 07 '12 at 21:09

Grouped LIMIT in PostgreSQL: show the first N rows for each group?

6 Answers6

Linked

Related