Remove Dupes with checks on secondary fields first

Question

I have a table with a field (Name) I'd like to create a unique index on, however it seems there are existing duplicates. I dont' want to just get rid of dupes since some might have information in other fields that I need. Essentially I have:

ID
ParentID
Name
Code
RelatedID

So Goal 1 is I want to keep the record that has values in the secondary fields other then ID and Name. In most cases this will be one of the dupes only.

Goal 2 is in case two identical Names both have values but in different fields I want to 'merge' those since it is remotely possible one duplicate will have values in one key field and one in the other.

Finally Goal 3 is in the case that two names both have values in a key field I'd probably want to manually review those first.

It seems to me my first step as I read this would be Goal 3; manually review duplicates where Name Field is identical, and more then one record has a non-Null/non-empty value in a key field.

Once I address this the goal would be to 'mere' the remaining records i.e keep one record with Name and any non-null/non-empty key fields from the others.

Any thoughts much appreciated.

score 0 · Answer 1 · edited May 23 '17 at 12:28

Sounds like a solid plan - hope you have a development environment you can dry run it in.

Here is some code that may help you along

Starting with Step 3. This statement should help you find which records need to be reviewed.

SELECT * 
FROM (
  SELECT name, 
    GROUP_CONCAT(DISTINCT parentID) AS parentID,
    GROUP_CONCAT(DISTINCT code) AS code,
    GROUP_CONCAT(DISTINCT RelatedID) AS RelatedID,
  FROM foo
  GROUP BY name
  HAVING COUNT(*)>1) as summarized
WHERE parentID LIKE '%,%'
  OR code LIKE '%,%'
  OR RelatedID LIKE '%,%';

Anything that comes up in that query you will probably have to manually fix after figuring out why there are multiple values for the same field.

Once those fixes are in place, it's times for the merge. I would create a holding / temporary table with the correct values. MAX should take care of the logic to choose non-null values

CREATE TABLE foo_values
SELECT name, MAX(parentID) as parentID, MAX(code) AS code, MAX(RelatedID) AS RelatedID.
FROM foo
GROUP BY name
HAVING COUNT(*)>1;

In theory, now you have the merged values. You can remove the duplicate name rows using whatever technique you are most comfortable with(See here) while adding your unique index. Finally, update the secondary fields by JOINing back to foo values.

Remove Dupes with checks on secondary fields first

1 Answers1