using a single query to eliminate N+1 select issue

Question

I want to return the last report of a given range of units. The last report will be identified by its time of creation. Therefore, the result would be a collection of last reports for a given range of units. I do not want to use a bunch of SELECT statements e.g.:

SELECT * FROM reports WHERE unit_id = 9999 ORDER BY time desc LIMIT 1
SELECT * FROM reports WHERE unit_id = 9998 ORDER BY time desc LIMIT 1
...

I initially tried this (but already knew it wouldn't work because it will only return 1 report):

'SELECT reports.* FROM reports INNER JOIN units ON reports.unit_id = units.id WHERE units.account_id IS NOT NULL AND units.account_id = 4 ORDER BY time desc LIMIT 1'

So I am looking for some kind of solution using subqueries or derived tables, but I can't just seem to figure out how to do it properly:

'SELECT reports.* FROM reports
WHERE id IN 
(
  SELECT id FROM reports
  INNER JOIN units ON reports.unit_id = units.id
  ORDER BY time desc
  LIMIT 1
)

Any solution to do this with subqueries or derived tables?

have you explored the [HAVING](http://www.techonthenet.com/mysql/having.php) clause? — bobkingof12vs, Mar 04 '14 at 22:02
I think this question may help - http://stackoverflow.com/questions/3800551/select-first-row-in-each-group-by-group?rq=1 as it sounds familiar. — SamV, Mar 04 '14 at 22:03

Gordon Linoff · Accepted Answer · 2014-03-05T12:30:05.860

2

The simple way to do this in Postgres uses distinct on:

select distinct on (unit_id) r.*
from reports r
order by unit_id, time desc;

This construct is specific to Postgres and databases that use its code base. It the expression distinct on (unit_id) says "I want to keep only one row for each unit_id". The row chosen is the first row encountered with that unit_id based on the order by clause.

EDIT:

Your original query would be, assuming that id increases along with the time field:

SELECT r.*
FROM reports r
WHERE id IN (SELECT max(id)
             FROM reports
             GROUP BY unit_id
            );

You might also try this as a not exists:

select r.*
from reports r
where not exists (select 1
                  from reports r2
                  where r2.unit_id = r.unit_id and
                        r2.time > r.time
                 );

I thought the distinct on would perform well. This last version (and maybe the previous) would really benefit from an index on reports(unit_id, time).

edited Mar 05 '14 at 12:30

answered Mar 04 '14 at 22:03

Gordon Linoff

1,242,037
58
646
786

Is there a way to do this with a where clause like this: "SELECT DISTINCT ON (unit_id) r.* FROM reports r WHERE reports.unit_id IN (9988,9999) ORDER BY unit_id, time DESC". This gives me something like invalid reference to FROM-clause entry for table "reports" – JohnMerlino Mar 04 '14 at 23:04
@JohnMerlino I think you just failed to use the alias; `WHERE r.unit_id IN (...)` – Craig Ringer Mar 04 '14 at 23:22
I tried using this query, but it is running incredibly slow. After around 10 minutes, I just had to cancel the load. When I was using the SELECT statements, it took long but not nearly as long (only 7 or 8 seconds). – JohnMerlino Mar 05 '14 at 00:27
Any ideas why this query runs so slow? – JohnMerlino Mar 05 '14 at 05:10

using a single query to eliminate N+1 select issue

1 Answers1