Deleting duplicate row that has earliest date

Question

I have a table called PF_temo that has the following structure:

firstname
middlename
lastname
DOB
address
city
state
phone
validitydate

It has many rows that are identical, except for the validity date. For example:

steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201609
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201002
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201706
steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,199812

I'd like to run a script that deletes all duplicates matched on everything but the last column (validitydate) leaving just the below in the table, which is the most recent validitydate of 201706:

steve,s,smith, 19710909,112 crazy st,miami,fl,3055551212,201706

This is what I have; though it throws an exception:

DELETE 
FROM PF_temp
LEFT OUTER JOIN 
(
    SELECT Min(ValidityDate) as RowId
    , firstname
    , middlename
    , lastname
    , DOB
    , address
    , city
    , state 
    , phone
    FROM PF_temp 
    GROUP BY firstname
    , middlename
    , lastname
    , DOB
    , address
    , city
    , state 
    , phone
    , validitydate 
) as KeepRows 
ON TableName.RowId = KeepRows.RowId
WHERE KeepRows.RowId IS NULL

It doesn't work and actually pops this error:

Msg 156, Level 15, State 1, Line 2
Incorrect syntax near the keyword 'OUTER'.
Msg 156, Level 15, State 1, Line 4
Incorrect syntax near the keyword 'as'.

Also, I'd like to run it in stages based on the last letter of the alphabet. So something like WHERE lastname like 'A%' needs to be added somewhere.

I think this has the answer you need: https://stackoverflow.com/questions/17548751/how-to-write-a-sql-delete-statement-with-a-select-statement-in-the-where-clause — Gratus D., Jun 23 '17 at 21:36
I do actually have a primary key but everythign I try still blows up. — Sal, Jun 23 '17 at 21:44

JohnLBevan · Accepted Answer · 2017-06-23T23:20:24.010

Try this:

delete a
from PF_Temp a
inner join PF_Temp b 
on  b.firstname = a.firstname 
and b.middlename = a.middlename
and b.lastname = a.lastname
and b.DOB = a.DOB
and b.address = a.address
and b.city = a.city
and b.state = a.state
and b.phone = a.phone
and b.validitydate > a.validitydate

Example at SQL Fiddle.

The above works by:

joining on all matching fields (except validity date), thus capturing in a all records which have duplicates. At this stage we capture all records, since the record in a would match with itself in b.
By specifying that the validitydate in b must be greater than that in a we both avoid the above issue of the record being the same (since if it were the same record, the validity date would be the same), and also ensuring that there's no match if the record in a is the most recent; since there will be no match in b (i.e. no record with a greater validity date).
we then delete every record which was returned by a; i.e. every record which has a duplicate with a later validity date.

If you want to only delete those duplicates with a specific last name, you do exactly what you said above; i.e. add the line where a.LastName like 'A%'.

Update

You mention that some columns may contain nulls. Here's a revised version of the above to take into account that null != null.

delete a
from PF_Temp a
inner join PF_Temp b 
on  ((b.firstname = a.firstname) or (b.firstname is null and a.firstname is null))
and ((b.middlename = a.middlename) or (b.middlename is null and a.middlename is null))
and ((b.lastname = a.lastname) or (b.lastname is null and a.lastname is null))
and ((b.DOB = a.DOB) or (b.DOB is null and a.DOB is null))
and ((b.address = a.address) or (b.address is null and a.address is null))
and ((b.city = a.city) or (b.city is null and a.city is null))
and ((b.state = a.state) or (b.state is null and a.state is null))
and ((b.phone = a.phone) or (b.phone is null and a.phone is null))
and b.validitydate > a.validitydate

An alternative to the above would be on coalesce(b.firstname,'') = coalesce(a.firstname) (repeating that pattern for all other matching fields); though that would mean that nulls and blanks were treated the same, and wouldn't perform quite so well.

Alternative Method

A different approach, which is more forgiving of nulls, is to use a subquery to pull back all values, numbering each set with matching values, starting at 1 for the most recent validity date. We then delete all those rows which came back with numbers higher than 1; i.e. any which are duplicates with earlier validity dates.

delete TheDeletables
from 
(
    select *
    , row_number() over (
        partition by 
         firstname 
        , middlename 
        , lastname 
        , DOB 
        , address  
        , city 
        , state 
        , phone 
        order by validitydate desc
    ) rowid
    from PF_Temp
) TheDeletables
where rowid > 1;

Demo SQL Fiddle.

For some reason this does not delete anything... 0 rows effected when there should be dups deleted — Sal, Jun 23 '17 at 22:31
Do any of the fields for your rows with duplicates contain `null`s? Are the values in the duplicates exactly the same, or could there be leading/trailing spaces? FYI: You can see what the script would delete by replacing `delete a` with `select distinct a.*` if you want to test out some scenarios without risking incorrect data loss. — JohnLBevan, Jun 23 '17 at 22:39
ahh.. yes some do have nulls.. actually all in my test file do and in the real file they do exist too. — Sal, Jun 23 '17 at 22:40
tha tseemed to work, perfect! But when I add the where clause to work on last name, I get this:Msg 209, Level 16, State 1, Line 13 Ambiguous column name 'lastname'. — Sal, Jun 23 '17 at 22:53
Include the `a.` before it; otherwise SQL doesn't know if you want to match on `a` or `b` (I know they're the same given the `on` clause; but SQL's not that clever). — JohnLBevan, Jun 23 '17 at 22:57
Also, lets say there are dups of the greatest date... the script seems to not delete the duplicates of the greatest validity date — Sal, Jun 23 '17 at 22:58
scratch that. The Alternative Method throws those out too. This works perfectly. You are the MAN!!!!! thank you!! — Sal, Jun 23 '17 at 23:01
No worries :). Agreed; the first one assumes that the records are unique when all fields are taken into account (i.e. including validity date), so I didn't prime it for that scenario / in that case the only option is the alternate version, or to include some form of identity field / some other disambiguator. — JohnLBevan, Jun 23 '17 at 23:05

Deleting duplicate row that has earliest date

1 Answers1

Linked