0

My table consists of duplicate email addresses. Each email address has a unique create date and a unique ID. I want to identify the email address with the most recent create date and its associated ID and show the duplicate ID with its create date as well. I would like the query to show this in the following format:

  • Column 1: EmailAddress
  • Column 2: IDKeep
  • Column 3: CreateDateofIDKeep
  • Column 4: DuplicateID
  • Column 5: CreateDateofDuplicateID

Note: There are instances where more than 2 duplicate email addresses exist. I would like the query to show each additional duplicate on a new row, re-stating the EmailAddress and IDKeep in these instances.

To no avail I've attempted to piece together different queries found on here. I'm currently at a loss--any help/direction would be greatly appreciated.

sqlbg
  • 73
  • 1
  • 11

2 Answers2

1

Complicated queries are best solved by breaking it up into pieces and working step-by-step.

First let's create a query to find the key of the row we want to keep, by finding the most recent create date for each email then joining to get the Id:

select x.Email, x.CreateDate, x.Id
from myTable x
join (
    select Email, max(CreateDate) as CreateDate
    from myTable
    group by Email
) y on x.Email = y.Email and x.CreateDate = y.CreateDate

Ok, now let's make a query to get duplicate email addresses:

select Email
from myTable
group by Email
having count(*) > 1

And join this query back to the table to get the keys for every row that has duplicates:

select x.Email, x.Id, x.CreateDate
from myTable x
join (
    select Email
    from myTable
    group by Email
    having count(*) > 1
) y on x.Email = y.Email

Great. Now all that is left is to join the first query with this one to get our result:

select keep.Email, keep.Id as IdKeep, keep.CreateDate as CreateDateOfIdKeep,
    dup.Id as DuplicateId, dup.CreateDate as CreateDateOfDuplicateId
from (
    select x.Email, x.CreateDate, x.Id
    from myTable x
    join (
        select Email, max(CreateDate) as CreateDate
        from myTable
        group by Email
    ) y on x.Email = y.Email and x.CreateDate = y.CreateDate
) keep
join (
    select x.Email, x.Id, x.CreateDate
    from myTable x
    join (
        select Email
        from myTable
        group by Email
        having count(*) > 1
    ) y on x.Email = y.Email
) dup on keep.Email = dup.Email and keep.Id <> dup.Id

Note the final keep.Id <> dup.Id predicate on the join ensures we don't get the same row for both keep and dup.

lc.
  • 113,939
  • 20
  • 158
  • 187
  • This is amazing and I think gives me exactly what I'm looking for. Only question is where can I insert a WHERE statement to remove instances where 'Email' is NULL? – sqlbg Apr 02 '15 at 02:05
  • Wherever you want really, although in fact this does not handle the NULL case and nulls will not be present from the result set (`keep.Email = dup.Email` will be NULL if either side is NULL). I assumed the email field was not null. – lc. Apr 02 '15 at 04:47
0

The following subquery uses a trick to get the latest id and creation date for each email:

select Email, max(CreateDate) as CreateDate,
       substring_index(group_concat(id order by CreateDate desc), ',', 1) as id
from myTable
group by Email
having count(*) > 1;

The having() clause also ensures that this is just for duplicated emails.

Then, this query just needs to be combined with the rest of the data to get the format that you want:

select t.Email, tkeep.id as keep_id, tkeep.CreateDate as keep_date,
       id as dup_id, CreateDate as dup_CreateDate
from myTable t join
     (select Email, max(CreateDate) as CreateDate,
             substring_index(group_concat(id order by CreateDate desc), ',', 1) as id
      from myTable
      group by Email
      having count(*) > 1
     ) tkeep
     on t.Email = tkeep.Email and t.CreateDate <> tkeep.CreateDate;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786