0

have a question... I have a table that has over 2 billion rows. Many are duplicates but there is a column (varchar) that has a validity date in a format such as 201806.

I want to dedupe the table BUT keep the most current date.

ex.

ID,fname, lname, addrees, city, state, zip, validitydate
1,steve,smith, pob 123, miami, fl. 33081,201709
2,steve,smith, pob 123, miami, fl. 33081,201010
3,steve,smith, pob 123, miami, fl. 33081,201809
4.steve,smith, pob 123, miami, fl. 33081,201201

I only want to keep: steve,smith, pob 123, miami, fl. 33081,201809 as it is the most current. If I run the below, it dedups, but it's a crap-shoot which one is left in the table as I cannot add the validityDate as the tsql will then look as all of them as unique.

How can I make it so it dedups but calculates to keep the most current date as the final entry?

thanks in advance.

WITH Records AS 
(
SELECT fname, lname, addrees, city, 
ROW_NUMBER() OVER (
  PARTITION BY fname, lname, addrees, city, state, zip, 
    validitydate by ID) AS RecordInstance 
FROM PEOPLE where lastname like 'S%'
)
DELETE
 FROM Records 
WHERE
 RecordInstance > 1
The Impaler
  • 45,731
  • 9
  • 39
  • 76
Sal
  • 295
  • 3
  • 5
  • 13
  • Possible duplicate of [Deleting duplicate row that has earliest date](https://stackoverflow.com/questions/44730226/deleting-duplicate-row-that-has-earliest-date) – SMor Sep 19 '18 at 21:19

3 Answers3

2

Order by month (descending) so the RecordInstance will be 1 for the most current one:

WITH Records AS (
  SELECT fname, lname, addrees, city,
     ROW_NUMBER() OVER (
       PARTITION BY fname, lname, addrees, city, state, zip
       ORDER BY validitydate DESC -- Add this to order correctly!
     ) AS RecordInstance
  FROM PEOPLE where lastname like 'S%' 
)
DELETE FROM Records WHERE RecordInstance > 1
The Impaler
  • 45,731
  • 9
  • 39
  • 76
1

The delete will also work with just the ROW_NUMBER in the CTE. Which is ordered by the descending validitydate. So that the most recent month will have row_number 1 and you can delete those > 1

WITH CTE AS
(
   SELECT 
   ROW_NUMBER() OVER (PARTITION BY fname, lname, addrees, city, state, zip ORDER BY validitydate DESC, ID DESC) AS rn
   FROM PEOPLE
   WHERE lname like 'S%'
)
DELETE 
FROM CTE
WHERE rn > 1;

A test can be found here

LukStorms
  • 28,916
  • 5
  • 31
  • 45
1

Here is a link to an article I wrote regarding this very issue.

https://sqlfundamentals.wordpress.com/delete-duplicate-rows-in-t-sql/

Hope this helps.

Aubrey Love
  • 946
  • 6
  • 12