1

I have two tables in a PostgreSQL 9.5 database:

project
  - id
  - name

task
  - id
  - project_id
  - name
  - updated_at

There are ~ 1000 projects (updated very rarely) and ~ 10 million tasks (updated very often).

I want to list those 10 distinct projects that have the latest task updates.

A basic query would be:

SELECT * FROM task ORDER BY updated_at DESC LIMIT 10;

However, there can be many updated tasks per project. So I won't get 10 unique projects.

If I try to add DISTINCT(project_id) somewhere in the query, I'm getting an error:

for SELECT DISTINCT, ORDER BY expressions must appear in select list

Problem is, I can't sort (primarily) by project_id, because I need to have tasks sorted by time. Sorting by updated_at DESC, project_id ASC doesn't work either, because several tasks of the same project can be among the latest.

I can't download all records because there are millions of them.

As a workaround I download 10x needed rows (without distinct) scope, and filter them in the backend. This works for most cases, but it's obviously not reliable: sometimes I don't get 10 unique projects.

Can this be solved efficiently in Postgres 9.5?

Example

 id |   name    
----+-----------
  1 | Project 1
  2 | Project 2
  3 | Project 3

 id | project_id |  name  |   updated_at    
----+------------+--------+-----------------
  1 |          1 | Task 1 | 13:12:43.361387
  2 |          1 | Task 2 | 13:12:46.369279
  3 |          2 | Task 3 | 13:12:54.680891
  4 |          3 | Task 4 | 13:13:00.472579
  5 |          3 | Task 5 | 13:13:04.384477

If I query:

SELECT project_id, updated_at FROM task ORDER BY updated_at DESC LIMIT 2

I get:

 project_id |   updated_at    
------------+-----------------
          3 | 13:13:04.384477
          3 | 13:13:00.472579

But I want to get 2 distinct projects with the respective latest task.update_at like this:

 project_id |   updated_at    
------------+-----------------
          3 | 13:13:04.384477
          2 | 13:12:54.680891  -- from Task 3
Community
  • 1
  • 1
Michael
  • 215
  • 1
  • 6
  • Add some sample table data, and the expected result. (As formatted text.) – jarlh Sep 12 '16 at 11:11
  • 1
    `distinct` is **not** a function, it always applies to **all** columns of the select list. `DISTINCT(a), b` is the same as `distinct a,b` –  Sep 12 '16 at 11:19
  • in PostgreSQL there is syntax for **select distinct on (...)** but it still isn't a function – Paul Maxwell Sep 12 '16 at 11:26
  • 1
    The actual table definition showing data types and constraints (`CREATE TABLE` statement) is the superior way to clarify things. Looks like updated_at is a `time` column? Should really be `timestamp`? And cardinalities are essential to optimize: how many rows in each table? Frequency of updates? – Erwin Brandstetter Sep 12 '16 at 11:57
  • @Erwin: I used very simplified model for the scope of the question. In reality `updated_at` is a `timestamp with time zone`. – Michael Sep 12 '16 at 12:14
  • Projects (about 1000) are updated very rarely, tasks (about 10 million) on the other hand are updated very very often. – Michael Sep 12 '16 at 12:15

5 Answers5

2

Try a group by expression, that's what it's aimed for :

SELECT project_id, max(update_date) as max_upd_date
FROM task t
GROUP BY project_id
order by max_upd_date DESC
LIMIT 10

Do not forget to put an index that begin with : project_id, update_date if you want to avoid full table scans.

Well the only way to use the index seems to be with correlated sub query :

select p.id, 
 (select upd_dte from task t where p.id = t.prj_id order by upd_dte desc limit 1) as max_dte
from project p
order by max_dte desc
limit 10
Nemeros
  • 415
  • 2
  • 7
  • This works well, but the proposed index doesn't seem to work. I tried to use `CREATE INDEX ON task (project_id, updated_at)`. – Michael Sep 12 '16 at 11:28
  • did you made an "analyze" afterwards ? – Nemeros Sep 12 '16 at 11:30
  • I used `EXPLAIN`: QUERY PLAN ------------------------------------------------------------------------- Limit (cost=1.18..1.18 rows=2 width=12) -> Sort (cost=1.18..1.19 rows=5 width=12) Sort Key: (max(updated_at)) DESC -> HashAggregate (cost=1.07..1.12 rows=5 width=12) Group Key: project_id -> Seq Scan on task t (cost=0.00..1.05 rows=5 width=12) (6 rows) – Michael Sep 12 '16 at 11:31
  • you are right. the only way is to use correlated subquery, i updated my solution. Dunno which one is better compared to Erwin one's – Nemeros Sep 12 '16 at 18:06
2

The simple (logically correct) solution is to aggregate tasks to get the latest update per project, and then pick the latest 10, like @Nemeros provided.

However, this incurs a sequential scan on task, which is undesirable (expensive) for big tables.

If you have relatively few projects (many task entries per project), there are faster alternatives using (bitmap) index scans.

SELECT *
FROM   project p
     , LATERAL (
   SELECT updated_at AS last_updated_at
   FROM   task
   WHERE  project_id = p.id
   ORDER  BY updated_at DESC
   LIMIT  1
   ) t
ORDER  BY t.last_updated_at
LIMIT  10;

Key to performance is a matching multicolumn index:

CREATE INDEX task_project_id_updated_at ON task (project_id, updated_at DESC);

A setup with 1000 projects and 10 million tasks (like you commented) is a perfect candidate for this.

Background:

NULL and "no row"

Above solution assumes updated_at is defined NOT NULL. Else use ORDER BY updated_at DESCNULLS LAST and ideally make the index match.

Projects without any tasks are eliminated from the result by the implicit CROSS JOIN. NULL values cannot creep in this way. This is subtly different from correlated subqueries like @Nemeros added to his answer: those return NULL values for "no row" (project has no related tasks at all). The outer descending sort order then lists NULL on top unless instructed otherwise. Most probably not what you want.

Related:

Community
  • 1
  • 1
Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • 1
    Thanks. I was worried about not using correct indexes, but it seems that they were simply not used on small dev dataset. It works fine on production and is much faster than the `GROUP BY` version. – Michael Sep 12 '16 at 16:28
  • @Erwin, might you explain the principal difference between your solution and the updated one i putted ? the plan are different but the logic behind seems kinda the same. – Nemeros Sep 12 '16 at 18:08
  • 1
    @Nemeros: I added some more explanation to the answer. For more, consider the comparison of both techniques (and more) [in the linked answer above](http://stackoverflow.com/questions/25536422/optimize-group-by-query-to-retrieve-latest-record-per-user/25536748#25536748):. – Erwin Brandstetter Sep 12 '16 at 23:11
1

try to use

SELECT project_id, 
       Max (updated_at) 
FROM   task 
GROUP  BY project_id 
ORDER  BY Max(updated_at) DESC 
LIMIT  10 
Christian
  • 827
  • 6
  • 14
0

How about sorting the records by the most recent update and then doing distinct on?

select distinct on (t.project_id) t.*
from tasks t
order by max(t.update_date) over (partition by t.project_id), t.project_id;

EDIT:

I didn't realize Postgres did that check. Here is the version with a subquery:

select distinct on (maxud, t.project_id) t.*
from (select t.*,
             max(t.update_date) over (partition by t.project_id) as maxud
      from tasks t
     ) t
order by maxud, t.project_id;

You could probably put the analytic call in the distinct on, but I think this is clearer anyway.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Doesn't work unfortunately: `ERROR: SELECT DISTINCT ON expressions must match initial ORDER BY expressions` – Michael Sep 12 '16 at 11:19
0

I believe row_number() over() can be used for this but you will still need the final order by and limit clauses:

select
   mt.*
from (
     SELECT
          * , row_number() over(partition by project_id order by updated_at DESC) rn
     FROM tasks 
     ) mt
-- inner join Projects p on mt.project_id = p.id
where mt.rn = 1
order by mt.updated_at DESC
limit 2

Advantage of this approach gives you access to the full row corresponding to the maximum updated_at for each project. You can optionally join the projects table as well

result:

| id | project_id |   name |      updated_at | rn |
|----|------------|--------|-----------------|----|
|  5 |          3 | Task 5 | 13:13:04.384477 |  1 |
|  3 |          2 | Task 3 | 13:12:54.680891 |  1 |

see: http://sqlfiddle.com/#!15/ee039/1

Paul Maxwell
  • 33,002
  • 3
  • 32
  • 51