Subquery with GROUP BY and MAX won't use index

Question

I have this table in PostgreSQL 15.3 (corresponding to a Django model):

                                             Table "public.myapp1_task"
         Column          |           Type           | Collation | Nullable |                     Default
-------------------------+--------------------------+-----------+----------+-------------------------------------------------
 id                      | bigint                   |           | not null | nextval('myapp1_task_id_seq'::regclass)
 created_at              | timestamp with time zone |           | not null |
 updated_at              | timestamp with time zone |           | not null |
 kind                    | character varying(12)    |           | not null |
 status                  | character varying(12)    |           | not null |
 environment             | character varying(7)     |           | not null |
 data                    | jsonb                    |           | not null |
 result                  | jsonb                    |           | not null |
 sent_at                 | timestamp with time zone |           |          |
 response_at             | timestamp with time zone |           |          |
 priority                | smallint                 |           | not null |
 sequence                | integer                  |           |          |
 result_attachment       | character varying(100)   |           | not null |
 taxes                   | jsonb                    |           | not null |
 myapp2_item_id          | bigint                   |           |          |
 source                  | character varying(8)     |           | not null |
 user_id                 | bigint                   |           |          |
 custom_actions          | jsonb                    |           | not null |
 
Indexes:
    "myapp1_task_pkey" PRIMARY KEY, btree (id)
    "myapp1_task_user_id_76a104e9" btree (user_id)
    "myapp1_task_myapp2_item_idd_441d91cb" btree (myapp2_item_id)
    "sequence_idx" btree (sequence DESC NULLS LAST)
    "sequence_mc_idx" btree (sequence, myapp2_item_id DESC NULLS LAST)

Goals: for each myapp2_item_id, find the row with the highest sequence.

I added the last two indexes related to the sequence column.

Using Django ORM, I'm trying to filter a queryset, here's the code:

queryset = Task.objects.all()
sequences = queryset.filter(item=OuterRef("item")).exclude(sequence__isnull=True).order_by("-sequence").distinct().values("sequence")
max_sequences = sequences.annotate(max_seq=Max("sequence")).values("max_seq")[:1]
filtered_queryset = queryset.filter(sequence=Subquery(max_sequences))
print(filtered_queryset.query)

which translates that into this SQL statement. Note the subquery with group by and max aggregates:

SELECT "myapp1_task"."id"
FROM "myapp1_task"
         LEFT OUTER JOIN "myapp2_item"
                         ON ("myapp1_task"."myapp2_item_id" = "myapp2_item"."id")
         LEFT OUTER JOIN "myapp2_user" ON ("myapp2_item"."user_id" = "myapp2_user"."id")
         LEFT OUTER JOIN "myapp2_category"
                         ON ("myapp2_item"."myapp2_category_id" = "myapp2_category"."id")
         LEFT OUTER JOIN "myapp2_user" T5 ON ("myapp1_task"."user_id" = T5."id")
WHERE "myapp1_task"."sequence" = (SELECT "subquery"."max_seq"
                                          FROM (
                                          SELECT MAX(U0."sequence") AS "max_seq", U0."sequence"
                                                FROM "myapp1_task" U0
                                                WHERE (U0."myapp2_item_id" =
                                                       ("myapp1_task"."myapp2_item_id"))
                                                GROUP BY U0."sequence"
                                                ORDER BY U0."sequence" DESC
                                                LIMIT 1) subquery)

Sadly, it's very slow on a fairly large table (>1M rows). Inspecting the explain result, I got this -> seq scan on the subquery, so none of the new indexes are used:

Seq Scan on myapp1_task  (cost=0.00..5525.25 rows=3 width=8)
  Filter: (sequence = (SubPlan 1))
  SubPlan 1
    ->  Subquery Scan on subquery  (cost=8.30..8.33 rows=1 width=4)
          ->  Limit  (cost=8.30..8.32 rows=1 width=8)
                ->  GroupAggregate  (cost=8.30..8.32 rows=1 width=8)
                      Group Key: u0.sequence
                      ->  Sort  (cost=8.30..8.31 rows=1 width=4)
                            Sort Key: u0.sequence DESC
                            ->  Index Scan using myapp1_task_myapp2_item_idd_441d91cb on myapp1_task u0  (cost=0.28..8.29 rows=1 width=4)
                                  Index Cond: (myapp2_item_id = myapp1_task.myapp2_item_id)

Not sure what I'm doing wrong. How can this be improved?

1) I haven't broken down the entire query but this `...FROM "myapp1_task" U0 WHERE (U0."myapp2_item_id" = ("myapp1_task"."myapp2_item_id")` to me indicates the sub-query is going to run over all the rows, hence the Seq Scan. 2) You should include the Django code that generated this query as update to your question. — Adrian Klaver, Jul 17 '23 at 16:24
It’s an empty table, no point in doing an index scan. That would be slower. If the table does have data, you might want to run analyze to update the statistics. And check auto analyze — Frank Heikens, Jul 17 '23 at 18:11
Consider instructions for Postgres performance questions: https://stackoverflow.com/tags/postgresql-performance/info — Erwin Brandstetter, Jul 18 '23 at 13:20

Erwin Brandstetter · Answer 1 · 2023-07-18T13:16:25.070

Either you or your ORM (or both) have twisted and obfuscated the SQL statement to a degree that any RDBMS would have a hard time to distill an efficient query plan from it. After removing a lot of cruft, the (equivalent) statement reads:

SELECT t.id
FROM   myapp1_task t
LEFT   JOIN myapp2_item i     ON t.myapp2_item_id = i.id
LEFT   JOIN myapp2_user iu    ON i.user_id = iu.id
LEFT   JOIN myapp2_category c ON i.myapp2_category_id = c.id
LEFT   JOIN myapp2_user tu    ON t.user_id = tu.id
WHERE  t.sequence = (
   SELECT t1.sequence
   FROM   myapp1_task t1
   WHERE  t1.myapp2_item_id = t.myapp2_item_id
   ORDER  BY t1.sequence DESC
   LIMIT  1
   );

(GROUP BY and MAX were useless noise in the original.)

The correlated subquery in the WHERE clause filters rows from myapp1_task where sequence sorts first in descending sort order among rows with the same myapp2_item_id - in a very expensive way. Due to your peculiar query and table definition, any rows are eliminated where myapp2_item_id or sequence is null, or any other row with the same myapp2_item_id and sequence IS NULL exists.

All LEFT JOIN rows are just noise since the SELECT list only returns myapp1_task.id anyway. The only possible effect of those joins would be to multiply rows if the left side has duplicates, which seems an unlikely endeavor.

Solution

You later added:

Goals: for each myapp2_item_id, find the row with the highest sequence.

Still does not clarify how to deal with null values. Nor how to deal with duplicates. Nor what to return exactly. All of which matters.

Assuming:

The combination (myapp2_item_id, sequence) is actually UNIQUE.
You want the highest not-null sequence.
You want to return whole rows (SELECT *) - which is often wasteful nonsense.

Then the query boils down to just (!):

SELECT DISTINCT ON (myapp2_item_id) *
FROM   myapp1_task
ORDER  BY myapp2_item_id, sequence DESC NULLS LAST;

See:

Select first row in each GROUP BY group?

About DESC NULLS LAST:

Sort by column ASC, but NULL values first?

Index

The best index for this query is a multicolumn index with matching sort order leading myapp2_item_id: (myapp2_item_id, sequence DESC NULLS LAST).

For only few rows per value in myapp2_item_id, the index will not help (much) - especially for SELECT *. A sequential scan can be as fast or faster. The index becomes more useful with a growing number of rows per group. For large numbers, special query techniques are superior. See:

Postgres 16 will ship with a number of performance optimizations in this area, due late 2023.

many thanks for youyr reply, I added the goals in the original question. — Luke, Jul 18 '23 at 07:14

Subquery with GROUP BY and MAX won't use index

1 Answers1

Solution

Index