How to sort and paginate groups of rows based on first row in the group?

Question

I have items with a timestamp and a foreign key id. I want to group them by the foreign key, sort each group by timestamp taking the first 3 from each group, and also sort all groups by the timestamp of the first item, like this:

+----+-------+-------+-------+
| id | item1 | item2 | item3 |
+----+-------+-------+-------+
| A  |     1 |    13 |    99 |
| B  |    10 |    20 |    21 |
| C  |    50 |    51 |    60 |
| D  |    56 |    70 |    75 |
+----+-------+-------+-------+

I would also like to be able to select ranges based on the ts of the first item (so querying for ts > 5 AND ts < 55 would exclude A and D -- note that C contains a row with ts=60 but I still want to include that because the first element in that group has ts=50)

My approach right now is to find the ids of the first item in each set in a subquery, then select the topN for those ids, which doesn't seem ideal; we end up redoing the same sorts twice.

SELECT *
FROM   (SELECT Row_number()
                 OVER (
                   partition BY things.id
                   ORDER BY links.created_at) AS r2,
               links.*
        FROM   things
               INNER JOIN links
                       ON ( links.b_id = things.id )
        WHERE  b_id IN (SELECT thing_id
                               FROM
               (SELECT Row_number()
                         OVER (
                           partition BY links.b_id
                           ORDER BY links.created_at) AS
                       r,
                       b_id                           AS
                       thing_id,
                       created_at
                FROM   links
                WHERE  links.entity_b_type = 'thing'
                       AND links.user_id =
                           '1234') tmp
                               WHERE  r = 1
                                      AND created_at < some_time)) tmp
WHERE  r2 <= 5;

Can I somehow sort the original results (with r <= 3) without the second select pass?

With sample data it's a little difficult to tell what's really going on here but I suspect you might want to either use a `WIth` clause or use more than one `Row_Number` in the select clause — Conrad Frix, Jul 24 '15 at 18:00
@ConradFrix: sorry, the sample data is kind of inverse of the actual logic below (lower values sorted first, whereas in reality higher timestamps are sorted first). Basically for every row I want to get a tuple `(n, r)` such that n is the rank of the group and r is rank within the group, but I *also* want to be able to paginate arbitrarily on the latest timestamp of the group — Arkadiy Kukarkin, Jul 24 '15 at 18:04
I dont' think this answers your question But somthing [like this](http://sqlfiddle.com/#!15/68159/10) might be what you're looking for — Conrad Frix, Jul 24 '15 at 20:28
A question like this should provide a table definition (`\d tbl` in psql) or valid `CREATE TABLE` script, some sample data (`INSERT` script) and the Postgres version. Usefulness of the desired output is limited without also seeing the sample it is based on. Questions should be useful to the general public. You have a much better chance for a good answer this way. — Erwin Brandstetter, Jul 25 '15 at 00:44

score 1 · Answer 1 · edited May 23 '17 at 12:30

Assuming referential integrity between things and links, the query you display can be simplified to:

SELECT *
FROM  (
   SELECT *, row_number() OVER (PARTITION BY b_id ORDER BY created_at) AS rn
   FROM   links l
   WHERE  EXISTS (
      SELECT 1
      FROM   links l1
      WHERE  l1.b_id = l.bid
      AND    l1.entity_b_type = 'thing'
      AND    l1.user_id = '1234'  -- why quoted? not integer?
      AND    l1.created_at < some_time
      )
   ) l
JOIN   things t ON t.id = l.b_id 
WHERE  l.rn <= 5;

Depending on data distribution, chances are good that a LATERAL solution is even faster:

SELECT *
FROM   things t 
     , LATERAL (
   SELECT *, row_number() OVER (ORDER BY created_at) AS rn  -- optional info
   FROM   links l
   WHERE  l.b_id = t.id
   ORDER  BY created_at
   LIMIT  5
   ) l
WHERE  EXISTS (
   SELECT 1
   FROM   links l
   WHERE  l.b_id = t.id
   AND    l.entity_b_type = 'thing'
   AND    l.user_id = '1234'
   AND    l.created_at < some_time
   );

Detailed explanation (chapter "2a. LATERAL join"):

Optimize GROUP BY query to retrieve latest record per user

Key to performance are matching indexes. Indexing always depends on the complete picture, but these would make the query very fast:

CREATE INDEX links_idx1 ON links (user_id, entity_b_type, created_at);
CREATE INDEX links_idx2 ON links (b_id, created_at);

It is suspicious that you first check whether the first links.created_at for the given predicates entity_b_type = 'thing' AND user_id = '1234' is old enough, but then go on using the oldest rows per b_id irregardless of those predicates. If that's a mistake, the query might be simplified further.

Untested. It's hard to say more without basic information.

How to sort and paginate groups of rows based on first row in the group?

1 Answers1