Most efficient way to use aggregate functions in table join?

Question

I'm attempting to join two tables where all the records in table one have unique ids and table two can contain multiple records for an id in table one (ids in table one may not have any corresponding records in table two).

I would like to join only the most recent record in table two.

SELECT id, MAX(date) AS DATE FROM table_two

Above returns the most recent records for a particular id.

SELECT * FROM table_one t1 LEFT JOIN table_two t2 ON t1.id = t2.id

Above returns the joined table but returns the duplicates as well.

How can I construct the join statement to return only the most recent records in table two?

Something like this:

SELECT * FROM table_one t1 LEFT JOIN table_two t2 ON t1.id = t2.id, MAX(t2.date) AS date GROUP BY date

Above returns an error: ERROR: aggregate functions are not allowed in functions in FROM

I know I can use a subquery to get the most recent record but not sure of the most efficient way to do this.

What version of postgresql? (Determins if lateral is an option or if the row_Number method should be used or if we just need to subquery. — xQbert, Jul 24 '17 at 18:58

sagar gholap · Answer 1 · 2017-07-25T12:26:38.203

2

You have to make use of subquery. Here are two possible answers:

Using row_number and subquery

select * from(
    SELECT t1.*,row_number() over(partition by t2.id order by t2.date desc) as row_num
    FROM table_one t1
    LEFT JOIN table_two t2 ON t1.id = t2.id
)t
where t.row_num = 1

Using max and subquery

SELECT * FROM table_one t1 
LEFT JOIN 
(
    SELECT id, MAX(date) AS DATE FROM table_two group by id
)t2
ON t1.id = t2.id

edited Jul 25 '17 at 12:26

answered Jul 24 '17 at 18:50

sagar gholap

21
5

the 2nd approach here doesn't get all the column information from t2; just the max date. may need to do a self join so all columns can be returned. – xQbert Jul 24 '17 at 19:57
@sagar The first query doesn't seem to return all of the information from t1, the second query requires a group by in order to use the aggregate function? – Don Jul 24 '17 at 20:43
yes right @DonPeat, group by id needed in second query. For first query (t1.*) should return all the columns – sagar gholap Jul 25 '17 at 12:28

xQbert · Accepted Answer · 2017-07-24T19:22:30.273

1

Both of these are untested; I'll get it working if a rextester.com test case is setup.

Postgresql 9.2? and up, I think, implements an outer apply using the lateral keyword. This basically runs a select against table 1 and then runs the select from table 2 for each record in table 1; returning from table 2 the newest dated record for the respective ID; but without having to do so in a loop the engine can somehow handle the join as a "SET" and thus doesn't suffer from a loop/row-by-row performance impact.

SELECT * 
FROM table_one t1 
LEFT JOIN LATERAL (SELECT A.*
                   FROM table_two A
                   WHERE A.ID = T1.ID
                   ORDER BY A.Date desc
                   LIMIT 1) t2 
  ON TRUE

Grouped LIMIT in PostgreSQL: show the first N rows for each group? for other examples.

--Should work on many prior versions since its not using analytical functions or lateral.

SELECT T1.*, t2.*
FROM table_one t1 
LEFT JOIN (SELECT A.* 
           FROM table_two A
           INNER JOIN (SELECT MAX(date) AS date, ID 
                       FROM table_two
                       GROUP BY ID) B
              on B.ID = A.ID
             and B.Date = A.Date) t2
 ON t1.id = t2.id

basic concepts of the above... get the max date and ID for a record in table 2 join this back to the base set of table2 to get all the details of a record in table2 having the max date per ID. then join to table 1 to get t1 and t2 values.

edited Jul 24 '17 at 19:22

answered Jul 24 '17 at 18:50

xQbert

34,733
2
41
62

Both queries work, I haven't used `lateral` before. Is there a significant performance advantage by using one over the other (vs. a more simplified subquery)? – Don Jul 24 '17 at 19:19
1

Lateral is the newer syntax and should be more efficient as it doesn't have to generate the sub query results allowing the engine to generate a more optimal execution plan. I would recommend running the explain plan for both and analyze the results. I think you'll see lateral is more efficient; but that always depends on indexes, hardware, and data volume (table statistics) that the Cost Based Optimizer (CBO) uses. – xQbert Jul 24 '17 at 19:21
1

Put another way 10 years ago you'd have had to use the 2nd query. it's about the only way you could do what you wanted in Pure SQL. 5 years ago and still today you could use a row_number logic to get the earliest date per record and then join.. today, LATERAL (CROSS/OUTER APPLY in other engines) are used to return these situations when you have a 1-N relationship and want to return x rows from the N side. the DB Manufacturers deemed this CROSS Apply approach beneficial to handle these 1-N with x rows from inside. if they built it, it must be equal to or better than what was done before. – xQbert Jul 24 '17 at 19:31
yep, running explain on the methods proposed here has lateral winning by a landslide. Thanks for the information! – Don Jul 24 '17 at 19:47
@DonPeat take note of sagar's use of row_number prior to lateral, it ***was*** the most efficient. it's the 3rd way I didn't bother to spell out as they already had. – xQbert Jul 24 '17 at 19:56
sagar's queries don't return the correct results though I'm sure i could tweak them to work. The run time on row_number for this specific query is still not as fast as the group by query above. – Don Jul 24 '17 at 20:42

Most efficient way to use aggregate functions in table join?

2 Answers2