0

I have items with a timestamp and a foreign key id. I want to group them by the foreign key, sort each group by timestamp taking the first 3 from each group, and also sort all groups by the timestamp of the first item, like this:

+----+-------+-------+-------+
| id | item1 | item2 | item3 |
+----+-------+-------+-------+
| A  |     1 |    13 |    99 |
| B  |    10 |    20 |    21 |
| C  |    50 |    51 |    60 |
| D  |    56 |    70 |    75 |
+----+-------+-------+-------+

I would also like to be able to select ranges based on the ts of the first item (so querying for ts > 5 AND ts < 55 would exclude A and D -- note that C contains a row with ts=60 but I still want to include that because the first element in that group has ts=50)

My approach right now is to find the ids of the first item in each set in a subquery, then select the topN for those ids, which doesn't seem ideal; we end up redoing the same sorts twice.

SELECT *
FROM   (SELECT Row_number()
                 OVER (
                   partition BY things.id
                   ORDER BY links.created_at) AS r2,
               links.*
        FROM   things
               INNER JOIN links
                       ON ( links.b_id = things.id )
        WHERE  b_id IN (SELECT thing_id
                               FROM
               (SELECT Row_number()
                         OVER (
                           partition BY links.b_id
                           ORDER BY links.created_at) AS
                       r,
                       b_id                           AS
                       thing_id,
                       created_at
                FROM   links
                WHERE  links.entity_b_type = 'thing'
                       AND links.user_id =
                           '1234') tmp
                               WHERE  r = 1
                                      AND created_at < some_time)) tmp
WHERE  r2 <= 5;

Can I somehow sort the original results (with r <= 3) without the second select pass?

Arkadiy Kukarkin
  • 2,153
  • 14
  • 24
  • With sample data it's a little difficult to tell what's really going on here but I suspect you might want to either use a `WIth` clause or use more than one `Row_Number` in the select clause – Conrad Frix Jul 24 '15 at 18:00
  • @ConradFrix: sorry, the sample data is kind of inverse of the actual logic below (lower values sorted first, whereas in reality higher timestamps are sorted first). Basically for every row I want to get a tuple `(n, r)` such that n is the rank of the group and r is rank within the group, but I *also* want to be able to paginate arbitrarily on the latest timestamp of the group – Arkadiy Kukarkin Jul 24 '15 at 18:04
  • I dont' think this answers your question But somthing [like this](http://sqlfiddle.com/#!15/68159/10) might be what you're looking for – Conrad Frix Jul 24 '15 at 20:28
  • A question like this should provide a table definition (`\d tbl` in psql) or valid `CREATE TABLE` script, some sample data (`INSERT` script) and the Postgres version. Usefulness of the desired output is limited without also seeing the sample it is based on. Questions should be useful to the general public. You have a much better chance for a good answer this way. – Erwin Brandstetter Jul 25 '15 at 00:44

1 Answers1

1

Assuming referential integrity between things and links, the query you display can be simplified to:

SELECT *
FROM  (
   SELECT *, row_number() OVER (PARTITION BY b_id ORDER BY created_at) AS rn
   FROM   links l
   WHERE  EXISTS (
      SELECT 1
      FROM   links l1
      WHERE  l1.b_id = l.bid
      AND    l1.entity_b_type = 'thing'
      AND    l1.user_id = '1234'  -- why quoted? not integer?
      AND    l1.created_at < some_time
      )
   ) l
JOIN   things t ON t.id = l.b_id 
WHERE  l.rn <= 5;

Depending on data distribution, chances are good that a LATERAL solution is even faster:

SELECT *
FROM   things t 
     , LATERAL (
   SELECT *, row_number() OVER (ORDER BY created_at) AS rn  -- optional info
   FROM   links l
   WHERE  l.b_id = t.id
   ORDER  BY created_at
   LIMIT  5
   ) l
WHERE  EXISTS (
   SELECT 1
   FROM   links l
   WHERE  l.b_id = t.id
   AND    l.entity_b_type = 'thing'
   AND    l.user_id = '1234'
   AND    l.created_at < some_time
   );

Detailed explanation (chapter "2a. LATERAL join"):

Key to performance are matching indexes. Indexing always depends on the complete picture, but these would make the query very fast:

CREATE INDEX links_idx1 ON links (user_id, entity_b_type, created_at);
CREATE INDEX links_idx2 ON links (b_id, created_at);

It is suspicious that you first check whether the first links.created_at for the given predicates entity_b_type = 'thing' AND user_id = '1234' is old enough, but then go on using the oldest rows per b_id irregardless of those predicates. If that's a mistake, the query might be simplified further.

Untested. It's hard to say more without basic information.

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228