I have this table in PostgreSQL 15.3 (corresponding to a Django model):
Table "public.myapp1_task"
Column | Type | Collation | Nullable | Default
-------------------------+--------------------------+-----------+----------+-------------------------------------------------
id | bigint | | not null | nextval('myapp1_task_id_seq'::regclass)
created_at | timestamp with time zone | | not null |
updated_at | timestamp with time zone | | not null |
kind | character varying(12) | | not null |
status | character varying(12) | | not null |
environment | character varying(7) | | not null |
data | jsonb | | not null |
result | jsonb | | not null |
sent_at | timestamp with time zone | | |
response_at | timestamp with time zone | | |
priority | smallint | | not null |
sequence | integer | | |
result_attachment | character varying(100) | | not null |
taxes | jsonb | | not null |
myapp2_item_id | bigint | | |
source | character varying(8) | | not null |
user_id | bigint | | |
custom_actions | jsonb | | not null |
Indexes:
"myapp1_task_pkey" PRIMARY KEY, btree (id)
"myapp1_task_user_id_76a104e9" btree (user_id)
"myapp1_task_myapp2_item_idd_441d91cb" btree (myapp2_item_id)
"sequence_idx" btree (sequence DESC NULLS LAST)
"sequence_mc_idx" btree (sequence, myapp2_item_id DESC NULLS LAST)
Goals: for each myapp2_item_id
, find the row with the highest sequence.
I added the last two indexes related to the sequence
column.
Using Django ORM, I'm trying to filter a queryset, here's the code:
queryset = Task.objects.all()
sequences = queryset.filter(item=OuterRef("item")).exclude(sequence__isnull=True).order_by("-sequence").distinct().values("sequence")
max_sequences = sequences.annotate(max_seq=Max("sequence")).values("max_seq")[:1]
filtered_queryset = queryset.filter(sequence=Subquery(max_sequences))
print(filtered_queryset.query)
which translates that into this SQL statement. Note the subquery with group by
and max
aggregates:
SELECT "myapp1_task"."id"
FROM "myapp1_task"
LEFT OUTER JOIN "myapp2_item"
ON ("myapp1_task"."myapp2_item_id" = "myapp2_item"."id")
LEFT OUTER JOIN "myapp2_user" ON ("myapp2_item"."user_id" = "myapp2_user"."id")
LEFT OUTER JOIN "myapp2_category"
ON ("myapp2_item"."myapp2_category_id" = "myapp2_category"."id")
LEFT OUTER JOIN "myapp2_user" T5 ON ("myapp1_task"."user_id" = T5."id")
WHERE "myapp1_task"."sequence" = (SELECT "subquery"."max_seq"
FROM (
SELECT MAX(U0."sequence") AS "max_seq", U0."sequence"
FROM "myapp1_task" U0
WHERE (U0."myapp2_item_id" =
("myapp1_task"."myapp2_item_id"))
GROUP BY U0."sequence"
ORDER BY U0."sequence" DESC
LIMIT 1) subquery)
Sadly, it's very slow on a fairly large table (>1M rows). Inspecting the explain
result, I got this -> seq scan
on the subquery, so none of the new indexes are used:
Seq Scan on myapp1_task (cost=0.00..5525.25 rows=3 width=8)
Filter: (sequence = (SubPlan 1))
SubPlan 1
-> Subquery Scan on subquery (cost=8.30..8.33 rows=1 width=4)
-> Limit (cost=8.30..8.32 rows=1 width=8)
-> GroupAggregate (cost=8.30..8.32 rows=1 width=8)
Group Key: u0.sequence
-> Sort (cost=8.30..8.31 rows=1 width=4)
Sort Key: u0.sequence DESC
-> Index Scan using myapp1_task_myapp2_item_idd_441d91cb on myapp1_task u0 (cost=0.28..8.29 rows=1 width=4)
Index Cond: (myapp2_item_id = myapp1_task.myapp2_item_id)
Not sure what I'm doing wrong. How can this be improved?