Efficient way to get entire rows with MAX(id) when grouping by a foreign key

Question

Consider tables A, B and C. B and C are related to A through a foreign key, and there are many Bs and Cs with the same A foreign key.

Suppose the following query:

SELECT
  A.pk AS pk_a,
  MAX(B.id) AS new_b,
  MAX(C.id) AS new_c
FROM A
INNER JOIN B ON B.fk_a = pk_a
INNER JOIN C ON C.fk_a = pk_a
GROUP BY pk_a

I would like to retrieve the entire new_b and new_c rows from B and C for each GROUP BY pk_a.

Surely I could wrap this as a subselect and JOIN B ON b.id = new_b, and the same for C, but B and C are huge and I would like to avoid this.

I could also use SELECT DISTINCT ON(A.pk) A.pk, B.*, C.* and ORDER BY A.pk, B.id, C.id, but that would only guarantee the latest B., not the latest C..

Is there any other way I'm missing?

Since `B and C are huge`, performance is an issue. To optimize, we need more information: How many rows in A, B and C? How many rows from A are referenced in B and C respectively? I suspect *many* related rows in B and C for one row in A? — Erwin Brandstetter, Jan 27 '17 at 15:47
BTW: column names in the example is inconsistent: `A.pk` <> `A.pk_a`. (The join clause cannot reference the column alias `pk_a`, only the source column name `pk`. Two answers copied this mistake, which makes them equally incorrect. — Erwin Brandstetter, Jan 27 '17 at 16:22

score 3 · Accepted Answer · edited May 23 '17 at 11:53

For few rows (like 2 or 3 or 5 on avg., depends) in B and C per row in A, DISTINCT ON is typically fastest.

For many rows per row in A, there are (much) more efficient solutions. And your information: "B and C are huge" indicates as much.
I suggest LATERAL subqueries with ORDER BY and LIMIT 1, backed by a matching index.

SELECT A.pk AS pk_a, B.*, C.*
FROM   A
LEFT   JOIN LATERAL (
   SELECT *
   FROM   B
   WHERE  B.fk_a = A.pk  -- lateral reference
   ORDER  BY B.id DESC
   LIMIT  1
   ) B ON true
LEFT JOIN LATERAL (
   SELECT *
   FROM   C
   WHERE  C.fk_a = A.pk  -- lateral reference
   ORDER  BY C.id DESC
   LIMIT  1
   ) C ON true;

Assuming B.id and C.id are NOT NULL.

You need at least indexes on the FK columns. Ideally, multi-column indexes on B (fk_a, id DESC) and C (fk_a, id DESC) though.

Use LEFT JOIN! to not exclude rows from A that are not referenced in either B or C. It would be an evil trap to use [INNER] JOIN here, since you join to two unrelated tables.

Detailed explanation:

Optimize GROUP BY query to retrieve latest record per user

The result of above query has pk_a once and fk_a twice. Useless ballast - and the same column name twice may be an actual problem, depending on your client.

You can spell out a column list in the outer SELECT (instead of the syntax shortcut A.*, B.*) to avoid redundancies. You may have to do that either way if there are more duplicate names or if you don't want all columns.

But with a smart naming convention, the USING clause can fold the redundant PK and FK columns for you:

SELECT *
FROM   A
LEFT   JOIN LATERAL (
   SELECT * FROM B
   WHERE  B.a_id = A.a_id
   ORDER  BY B.id DESC
   LIMIT  1
   ) B USING (a_id)
LEFT   JOIN LATERAL (
   SELECT * FROM C
   WHERE  C.a_id = A.a_id
   ORDER  BY C.id DESC
   LIMIT  1
   ) C USING (a_id);

Logically, USING (a_id) is redundant here, since WHERE B.a_id = A.a_id in the subquery already filters the same way. But the additional effect of USING is that joining columns are folded to one instance. So only one a_id remains in the result. The manual:

Furthermore, the output of JOIN USING suppresses redundant columns: there is no need to print both of the matched columns, since they must have equal values. While JOIN ON produces all columns from T1 followed by all columns from T2, JOIN USING produces one output column for each of the listed column pairs (in the listed order), followed by any remaining columns from T1, followed by any remaining columns from T2.

It also typically makes a lot of sense to use the same name for the same data. So: a_id for PK and FK columns.

The simple(r) `(fk_a, id)` indexes would be sufficient, because of backwards index scans, aren't they? Though the `(fk_a, id DESC)` indexes are good enough, but why complicate their directions? — pozs, Jan 27 '17 at 16:36
@pozs: Yes, reverse index scan is *almost* as fast in Postgres - generally. So a plain index is typically good enough for both directions. But for *certain* applications of ***multicolumn*** indexes, the order per column does matter (including this one). See: http://dba.stackexchange.com/a/39599/3684 And since adding `DESC` (or even `DESC NULLS LAST`) means no complications other than adding the key words, I would highly recommend it if that's the typical order you need. — Erwin Brandstetter, Jan 27 '17 at 16:44

Gordon Linoff · Answer 2 · 2017-01-28T03:57:30.527

Is this what you are asking for?

SELECT abc.*
FROM (SELECT A.pk AS pk_a, b.*, c.*,
             ROW_NUMBER() OVER (PARTITION BY a.pk ORDER BY b.id DESC) as seqnum_b,
             ROW_NUMBER() OVER (PARTITION BY a.pk ORDER BY c.id DESC) as seqnum_c
      FROM A INNER JOIN
           B
           ON B.fk_a = pk_a INNER JOIN
           C
           ON C.fk_a = pk_a
     ) abc
WHERE seqnum_b = 1 or seqnum_c = 1;

Actually, I think the above is on the right track, but you probably want:

SELECT a.pk, b.*, c.*
FROM A INNER JOIN
     (SELECT DISTINCT ON (b.fk_a) b.*
      FROM b
      ORDER BY b.fk_a, b.id DESC
     ) b 
     ON B.fk_a = pk_a JOIN
     (SELECT DISTINCT ON (c.fk_a) c.*
      FROM c
      ORDER BY c.fk_a, c.id DESC
     ) c
     ON c.fk_a = pk_a;

In Postgres 9.5, you can also use lateral joins for a similar effect.

Thanks. I tested the second alternative, seems okay, just submitted a minor edit to correct a few typos. — Leo Brito, Jan 27 '17 at 15:43
I think there might be some kind of typo around `b INNER JOIN ON`. — pozs, Jan 27 '17 at 16:38

score 1 · Answer 3 · answered Jan 27 '17 at 15:18

1

How about this:

SELECT DISTINCT
  A.pk AS pk_a,
  MAX(B.id) OVER(PARTITION BY pk_a) AS new_b,
  MAX(C.id) OVER(PARTITION BY pk_a) AS new_c
FROM A
INNER JOIN B ON B.fk_a = pk_a
INNER JOIN C ON C.fk_a = pk_a

answered Jan 27 '17 at 15:18

Giorgos Betsos

71,379
9
63
98

1

You may have missed *"get entire rows"* in the title and the body of the Q. – Erwin Brandstetter Jan 27 '17 at 16:50

Efficient way to get entire rows with MAX(id) when grouping by a foreign key

3 Answers3

Simpler syntax with smart naming convention