have a question... I have a table that has over 2 billion rows. Many are duplicates but there is a column (varchar) that has a validity date in a format such as 201806.
I want to dedupe the table BUT keep the most current date.
ex.
ID,fname, lname, addrees, city, state, zip, validitydate
1,steve,smith, pob 123, miami, fl. 33081,201709
2,steve,smith, pob 123, miami, fl. 33081,201010
3,steve,smith, pob 123, miami, fl. 33081,201809
4.steve,smith, pob 123, miami, fl. 33081,201201
I only want to keep: steve,smith, pob 123, miami, fl. 33081,201809 as it is the most current. If I run the below, it dedups, but it's a crap-shoot which one is left in the table as I cannot add the validityDate as the tsql will then look as all of them as unique.
How can I make it so it dedups but calculates to keep the most current date as the final entry?
thanks in advance.
WITH Records AS
(
SELECT fname, lname, addrees, city,
ROW_NUMBER() OVER (
PARTITION BY fname, lname, addrees, city, state, zip,
validitydate by ID) AS RecordInstance
FROM PEOPLE where lastname like 'S%'
)
DELETE
FROM Records
WHERE
RecordInstance > 1