How to break ties when comparing columns in SQL

Question

I am trying to delete duplicates in Postgres. I am using this as the base of my query:

DELETE FROM case_file as p
WHERE EXISTS (
    SELECT FROM case_file as p1
    WHERE p1.serial_no = p.serial_no
    AND p1.cfh_status_dt < p.cfh_status_dt
    );

It works well, except that when the dates cfh_status_dt are equal then neither of the records are removed.

For rows that have the same serial_no and the date is the same, I would like to keep the one that has a registration_no (if any do, this column also has NULLS).

Is there a way I can do this with all one query, possibly with a case statement or another simple comparison?

`I would like to keep the one that has a registration_no (if any do, this column also has NULLS).` So can there be more than one among ties with `registration_no IS NOT NULL`? — Erwin Brandstetter, Jul 20 '20 at 23:56
And what to do if registration_no is null for all rows? Keep the one and delete others? — User9123, Jul 20 '20 at 23:57
Also, is there a PK or any UNIQUE (combination of) column(s). An exact table definition (`CRATE TABLE` statement) showing data types and constraints always helps to clarify — Erwin Brandstetter, Jul 21 '20 at 00:00

Erwin Brandstetter · Accepted Answer · 2020-07-21T00:19:18.977

DELETE FROM case_file AS p
WHERE  id NOT IN (
   SELECT DISTINCT ON (serial_no) id  -- id = PK
   FROM   case_file 
   ORDER  BY serial_no, cfh_status_dt DESC, registration_no
   );

This keeps the (one) latest row per serial_no, choosing the smallest registration_no if there are multiple candidates.

NULL sorts last in default ascending order. So any row with a not-null registration_no is preferred.

If you want the greatest registration_no instead, to still sort NULL values last, use:

   ...
   ORDER  BY serial_no, cfh_status_dt DESC, registration_no DESC NULLS LAST

See:

If you have no PK (PRIMARY KEY) or other UNIQUE NOT NULL (combination of) column(s) you can use for this purpose, you can fall back to ctid. See:

How do I (or can I) SELECT DISTINCT on multiple columns?

NOT IN is typically not the most efficient way. But this deals with duplicates involving NULL values. See:

How to delete duplicate rows without unique identifier

If there are many duplicates - and you can afford to do so! - it can be (much) more efficient to create a new, pristine table of survivors and replace the old table, instead of deleting the majority of rows in the existing table.

Or create a temporary table of survivors, truncate the old and insert from the temp table. This way depending objects like views or FK constraints can stay in place. See:

How to delete duplicate entries?

Surviving rows are simply:

SELECT DISTINCT ON (serial_no) *
FROM   case_file 
ORDER  BY serial_no, cfh_status_dt DESC, registration_no;

How to break ties when comparing columns in SQL

1 Answers1

Linked