0

I have a table with a bit column named IsRetry. I want to remove all the duplicate email addresses that are not set as IsRetry. If there are duplicates of the IsRetry I want to remove those too. I tried partitioning by IsRetry, but I am still ending up with emails that are set as IsRetry and not. The SQL below is removing the duplicates, but it is keeping one of each email. Meaning, one set as IsRetry and one not. Should I run a query to remove the initial duplicates then another to remove any duplicates that are not IsRetry or can I make this into one query?

CREATE TABLE Emails
(
    [EmailAddress] varchar(50)
    ,[Md5Hash] NVARCHAR(32)
    ,[IsRetry] bit
);

insert into Emails (EmailAddress,Md5Hash,IsRetry) 
    VALUES('email1@gmail.com',  '4d35f7e71ffa6246b55f14da99e18ede', 1)
insert into Emails (EmailAddress,Md5Hash,IsRetry) 
    VALUES('email1@gmail.com',  '4d35f7e71ffa6246b55f14da99e18ede', 0)
insert into Emails (EmailAddress,Md5Hash,IsRetry) 
    VALUES('someemail@testplace.com',   'c8e5b73cbe5438c62e9ca001929cb05d', 1)
insert into Emails (EmailAddress,Md5Hash,IsRetry) 
     VALUES('someemail@testplace.com',  'c8e5b73cbe5438c62e9ca001929cb05d', 0)
insert into Emails (EmailAddress,Md5Hash,IsRetry) 
    VALUES('someemail@testplace.com',   'c8e5b73cbe5438c62e9ca001929cb05d', 0)
insert into Emails (EmailAddress,Md5Hash,IsRetry) 
    VALUES('memeTest@hotmail.com',  'ddcb2ab8af5ff8c8040defebf471d5df', 1)
insert into Emails (EmailAddress,Md5Hash,IsRetry) 
    VALUES('memeTest@hotmail.com',  'ddcb2ab8af5ff8c8040defebf471d5df', 1)
insert into Emails (EmailAddress,Md5Hash,IsRetry) 
    VALUES('memeTest@hotmail.com',  'ddcb2ab8af5ff8c8040defebf471d5df', 0)
insert into Emails (EmailAddress,Md5Hash,IsRetry) 
    VALUES('memeTest@hotmail.com',  'ddcb2ab8af5ff8c8040defebf471d5df', 0)

;WITH cte 
AS(
     SELECT ROW_NUMBER() OVER (PARTITION BY Md5Hash,IsRetry ORDER BY ( SELECT 0)) RN
            ,IsRetry
            ,EmailAddress
     FROM Emails 
)
DELETE FROM cte
WHERE  RN > 1

Select * from Emails

Here is the fiddle for it

Daniel E.
  • 2,029
  • 3
  • 22
  • 28
DDiVita
  • 4,225
  • 5
  • 63
  • 117

1 Answers1

3

It sounds like you want to leave only one email. If IsRetry is set, then you want that to be the one set. If so, then you don't want to partition by IsRetry. You want that in the in the order by clause:

with todelete as (
     select e.*,
            row_number() over (partition by EmailAddress
                               order by IsRetry desc ) as seqnum
     from Emails
    )
delete from todelete
    where seqnum > 1;

I changed the partitioning to use email. It is possible to have two different emails with the same MD5 hash code.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • That is perfect! Just curious, what are the chances of a colliding MD5 hashed email? I know it is possible given the fixed length of MD5, but never gave much thought to collision. So, thanks for the heads up! – DDiVita Aug 15 '14 at 12:00
  • @DDiVita . . . Okay, it is very, very unlikely to ever happen with random email addresses (http://stackoverflow.com/questions/8852668/what-is-the-clash-rate-for-md5), but there still seems little reason to use the hash code instead of the email address itself. – Gordon Linoff Aug 15 '14 at 12:21