I have this very simple query, generated by my ORM (Entity Framework Core):
SELECT *
FROM "table1" AS "t1"
WHERE EXISTS (
SELECT 1
FROM "table2" AS "t2"
WHERE ("t2"."is_active" = TRUE) AND ("t1"."table2_id" = "t2"."id"))
ORDER BY "t1"."table2_id"
- There are 2 "is_active" records. The other involved columns ("id") are the primary keys. Query returns exactly 4 rows.
- Table 1 is 96 million records.
- Table 2 is 30 million records.
- The 3 columns involved in this query are indexed (is_active, id, table2_id).
- The C#/LINQ code that generates this simple query is: Table2.Where(t => t.IsActive).Include(t => t.Table1).ToList();`
SET STATISTICS 10000
was set to all of the 3 columns.VACUUM FULL ANALYZE
was run on both tables.
WITHOUT the ORDER BY
clause, the query returns within a few milliseconds, and I’d expect nothing else for 4 records to return. EXPLAIN output:
Nested Loop (cost=1.13..13.42 rows=103961024 width=121)
-> Index Scan using table2_is_active_idx on table2 (cost=0.56..4.58 rows=1 width=8)
Index Cond: (is_active = true)
Filter: is_active
-> Index Scan using table1_table2_id_fkey on table1 t1 (cost=0.57..8.74 rows=10 width=121)
Index Cond: (table2_id = table1.id)
WITH the ORDER BY
clause, the query takes 5 minutes to complete! EXPLAIN output:
Merge Semi Join (cost=10.95..4822984.67 rows=103961040 width=121)
Merge Cond: (t1.table2_id = t2.id)
-> Index Scan using table1_table2_id_fkey on table1 t1 (cost=0.57..4563070.61 rows=103961040 width=121)
-> Sort (cost=4.59..4.59 rows=2 width=8)
Sort Key: t2.id
-> Index Scan using table2_is_active_idx on table2 a (cost=0.56..4.58 rows=2 width=8)
Index Cond: (is_active = true)
Filter: is_active
The inner, first index scan should return no more than 2 rows. Then the outer, second index scan doesn't make any sense with its cost of 4563070 and 103961040 rows. It only has to match 2 rows in table2
with 4 rows in table1
!
This is a very simple query with very few records to return. Why is Postgres failing to execute it properly?