2

I have a peculiar dupes problem. I can identify the records easily enough, but I need to do what's essentially a merge on some accompanying data.

Here's the problem. My table is a bit like this:

CREATE TABLE `People` (
  `PersonId` int(11) NOT NULL AUTO_INCREMENT,
  `Address` varchar(255) DEFAULT NULL,
  `Title` varchar(50) DEFAULT NULL,
  `Forename` varchar(150) DEFAULT NULL,
  `Surname` varchar(150) DEFAULT NULL,
  `FlagOne` bit(1) NOT NULL DEFAULT b'0',
  `FlagTwo` bit(1) NOT NULL DEFAULT b'0',
  `FlagThree` bit(1) NOT NULL DEFAULT b'0',
  PRIMARY KEY (`PersonId`)
)

The duplicate records differ only in title and flag values - they are identified as duplicates by having identical address, forename and surname fields:

PersonId    Address         Title   Forename    Surname FlagOne FlagTwo FlagThree
1           6 Smith Street  Mrs     Jane        Doe     1       0       0
2           6 Smith Street  Ms      Jane        Doe     0       1       0

What I can't figure out how to do is amalgamate these two into a single record which retains all the positive flags. It doesn't matter which of the two original records is retained - using PersonId to differentiate them is fine. So, something like this is the desired result:

PersonId    Address         Title   Forename    Surname FlagOne FlagTwo FlagThree
2           6 Smith Street  Ms      Jane        Doe     1       1       0

I know how to do an update based on a join, but I'm not sure how to implement the condiationals needed to get this particular result?

O. Jones
  • 103,626
  • 17
  • 118
  • 172
Bob Tway
  • 9,301
  • 17
  • 80
  • 162
  • 1
    `I can identify the records easily enough,` .... How you know those are duplicated rows? `person_id` are differents – Juan Carlos Oropeza Feb 28 '18 at 15:30
  • @JuanCarlosOropeza In this instance, any record that has the same address, forename and surname is treated as a duplicate – Bob Tway Feb 28 '18 at 15:33
  • Can the deduped records be placed into a new copy of the table? Or is part of your requirement to UPDATE some rows and DELETE others from the existing table? I suspect the former will be easier. Related question: do you need to preserve existing `PersonId` values? – O. Jones Feb 28 '18 at 15:35
  • Do you want to always keep the higher PersonID like your example ? – Daniel E. Feb 28 '18 at 15:36
  • @DanielE. No, it doesn't matter which record gets kept. – Bob Tway Feb 28 '18 at 15:41
  • @O.Jones No, we need to update and delete. This table is colossal - hundreds of millions of records, so I'll have to do it incrementally but that's a different problem. We do need to preserve existing PersonId values, yes, but it doesn't matter which one out of a dupe set gets kept. – Bob Tway Feb 28 '18 at 15:43

2 Answers2

3

You said you know how update with a JOIN, so something like this give the merge:

 SELECT MAX(PersonId),
        Address,
        MAX(Title), 
        Forename,
        Surname
        MAX( FlagOne ),
        MAX( FlagTwo ), 
        MAX( FlagThree)
FROM People
GROUP BY Address,
         Forename,
         Surname

Then you need to delete the duplicates

DELETE People
WHERE PersonID IN (SELECT MIN(PersonId)
                   FROM People
                   GROUP BY Address,
                            Forename,
                            Surname
                   HAVING COUNT(*) > 1)

this assume only duplicates, if you have three rows with same Forename, Surname. Will need a different aproach.

Juan Carlos Oropeza
  • 47,252
  • 12
  • 78
  • 118
  • Thanks for this - silly of me not think of using MAX. FYI there are a few instances where this issue spans 3+ rows, but they can be left in situ. The issue doesn't actually impact functionality, just storage space. – Bob Tway Feb 28 '18 at 15:47
1

I think you need to do it in 2 steps :

1- Update de the values :

Update People p
LEFT JOIN (
      SELECT MAX(PersonId) as PId,
            Address,
            MAX(Title) as title, 
            Forename,
            Surname
            MAX( FlagOne ) as FlagOne,
            MAX( FlagTwo ) as FlagTwo, 
            MAX( FlagThree) as Flagthree
    FROM People
    GROUP BY Address,
             Forename,
             Surname ) t

 ON t.Address = p.Address
AND t.Forename = p.Forename
AND t.Surname = p.Surname
SET p.FlagOne = t.FlagOne , 
    p.FlagTwo = t.FlagTwo , 
    p.FlagThree= t.FlagThree
WHERE p.PersonId = t.PId

2- Delete :

DELETE p
 FROM People p
 INNER JOIN 
    People t ON t.Address = p.Address
                  AND t.Forename = p.Forename
                  AND t.Surname = p.Surname
 WHERE p.PersonId < t.PersonId 
Daniel E.
  • 2,440
  • 1
  • 14
  • 24