0

I am facing a strange and infuriating problem of my PostgreSQL 10 database not using indexes in specific situations. I have created an index of an md5 hash of one of the columns, and this index seems to work fine under normal circumstances:

EXPLAIN
SELECT * FROM variants_table WHERE md5(variant_id) = md5('1_9746683_A_G');

QUERY PLAN
Index Scan using idx_variant_id_md5 on variants_table  (cost=0.71..50470.36 rows=12473 width=748)
   Index Cond: (md5(variant_id) = '00fa85f0ea97a4aa2898f838f427e560'::text)
(2 rows)

However, when I try to delete duplicates using this query:

EXPLAIN
WITH Ranked AS (
   SELECT
     md5(variant_id) as variant_id_md5,
     consscore,
     ROW_NUMBER() OVER (PARTITION BY md5(variant_id) ORDER BY consscore DESC) as rn
   FROM
     variants_table
 )
DELETE FROM variants_table
WHERE (md5(variant_id), consscore) IN (
   SELECT variant_id_md5, consscore
   FROM Ranked
   WHERE rn > 1
 );

QUERY PLAN
 Delete on variants_table  (cost=3647060725.61..4520666023.49 rows=2949470436 width=70)
   CTE ranked
     ->  WindowAgg  (cost=3096491910.93..3361944250.15 rows=11797881743 width=48)
           ->  Sort  (cost=3096491910.93..3125986615.29 rows=11797881743 width=40)
                 Sort Key: (md5(variants_table_1.variant_id)), variants_table_1.consscore DESC
                 ->  Seq Scan on variants_table variants_table_1  (cost=0.00..800237220.79 rows=11797881743 width=40)
   ->  Hash Join  (cost=285116475.46..1158721773.34 rows=2949470436 width=70)
         Hash Cond: ((md5(variants_table.variant_id) = ranked.variant_id_md5) AND (variants_table.consscore = ranked.consscore))
         ->  Seq Scan on variants_table  (cost=0.00..770742516.43 rows=11797881743 width=46)
         ->  Hash  (cost=285115875.46..285115875.46 rows=40000 width=104)
               ->  HashAggregate  (cost=285115475.46..285115875.46 rows=40000 width=104)
                     Group Key: ranked.variant_id_md5, ranked.consscore
                     ->  CTE Scan on ranked  (cost=0.00..265452339.22 rows=3932627248 width=104)
                           Filter: (rn > 1)
(14 rows)

It defaults to a sequential scan.

I have tried everything to debug this including setting enable_seqscan = off, but no matter what, the query planner still uses the sequential scan, leading me to assume that it doesn't know how to use the indexes.

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228
  • Could you please share the results from `EXPLAIN(ANALYZE, VERBOSE, BUFFERS, SETTINGS)` for these statements, the DDL for all tables and indexes involved, and everything in plain text. By the way, version 10 is dead and buried, I would upgrade to a recent version asap. It might solve your problem as well. – Frank Heikens Aug 09 '23 at 20:34
  • 1
    Your first and your second query are also very different from each other, the only thing these have in common, it's that they use the same table. The outdated and unsupported version 10 can't optimize the CTE either. – Frank Heikens Aug 09 '23 at 20:46
  • I have upgraded to version 14 and ran the following: `EXPLAIN(ANALYZE, VERBOSE, BUFFERS) SELECT * FROM variants_table WHERE md5(variant_id) = md5('1_9746683_A_G'); Index Scan using idx_variant_id_md5 on public.variants_table (cost=0.71..50470.36 rows=12473 width=748) (actual time=0.051..0.062 rows=3 loops=1) Index Cond: (md5(variants_table.variant_id) = '00fa85f0ea97a4aa2898f838f427e560'::text) Buffers: shared hit=7 Planning time: 0.309 ms Execution time: 0.163 ms (6 rows)` – Thomas Li Aug 09 '23 at 23:35
  • I can't run this on any of the other queries because EXPLAIN(ANALYZE) takes way to long – Thomas Li Aug 09 '23 at 23:35

1 Answers1

1

Problem

The window function row_number() has to process the whole table, so it doesn't pay to use your index to begin with. Hence the query plan descends into a sequential scan.

Better query

Postgres 15 brought some optimizations for row_number(), and the upcoming Postgres 16 ships with more, but I am not sure how much your DELETE query can profit.

Be that as it may, this simpler query should perform better in any case:

DELETE FROM variants_table d
WHERE  EXISTS (
   SELECT FROM variants_table v
   WHERE  md5(v.variant_id) = md5(d.variant_id)
   AND    v.consscore > d.consscore
   );

Read:
"Delete rows where another row with the same md5(variant_id) and a greater consscore exists in the same table!"

If there can be complete duplicates on (md5(variant_id), consscore), this query preserves all dupes with the highest consscore - unlike your original query. But your original query keeps an arbitrary winner, which is typically bad design, and you should really add a deterministic tiebreaker in this case - which you can use in my query accordingly.

If an arbitrary pick is good enough, you can use the system column ctid in a row-value comparison for this:

DELETE FROM variants_table d
WHERE  EXISTS (
   SELECT FROM variants_table v
   WHERE  md5(v.variant_id) = md5(d.variant_id)
   AND    (v.consscore, v.ctid) > (d.consscore, d.ctid)
   );

About row-value comparison:

Note that my direct comparison works differently for null values than your original. Typically, you don't want to use my query if variant_id or consscore can be null. See:

If null values can be involved, you'll have to define the desired behavior exactly.

If this is going to delete the majority of rows, other strategies may be more efficient. Like copying the few survivors into a pristine new table. See:

But I'd assume you are hunting for few duplicates. Consider making that expression index UNIQUE after you have cleaned up to prevent more dupes from creeping in. If possible.

Aside

Your query should be faster, yet, after optimizing that expression index with the data type uuid instead of text. See:

Or maybe a bigint hash is good enough? See:

Erwin Brandstetter
  • 605,456
  • 145
  • 1,078
  • 1,228