10

Say I have a table customers with the following fields and records:

id   first_name   last_name   email                  phone
------------------------------------------------------------------------
1    Michael      Turley      mturley@whatever.com   555-123-4567
2    John         Dohe        jdoe@whatever.com      
3    Jack         Smith       jsmith@whatever.com    555-555-5555
4    Johnathan    Doe                                123-456-7890

There are several other tables, such as orders, rewards, receipts which have foreign keys customer_id relating to this table's customers.id.

As you can see, in their infinite wisdom, my users have created duplicate records for John Doe, complete with inconsistent spelling and missing data. An administrator notices this, selects customers 2 and 4, and clicks "Merge". They are then prompted to select which value is correct for each field, etc etc and my PHP determines that the merged record should look like this:

id   first_name   last_name   email                  phone
------------------------------------------------------------------------
?    John         Doe         jdoe@whatever.com      123-456-7890

Let's assume Mr. Doe has placed several orders, earned rewards, generated receipts.. but some of these have been associated with id 2, and some have been associated with id 4. The merged row needs to match all of the foreign keys in other tables that matched the original rows.

Here's where I'm not sure what to do. My instinct is to do this:

DELETE FROM customers WHERE id = 4;

UPDATE customers
SET first_name = 'John',
    last_name  = 'Doe',
    email      = 'jdoe@whatever.com',
    phone      = '123-456-7890'
WHERE id = 2;

UPDATE orders, rewards, receipts
SET customer_id = 2
WHERE customer_id = 4;

I think that would work, but if later on I add another table that has a customer_id foreign key, I have to remember to go back and add that table to the second UPDATE query in my merge function, or risk loss of integrity.

There has to be a better way to do this.

Mike Turley
  • 1,172
  • 1
  • 9
  • 26
  • You could potentially query from information_schema for a list of tables that have customer_id as a column and update those tables accordingly. Although this introduces other risks. – sreimer Mar 08 '11 at 19:08

4 Answers4

6

I got here form google this is my 2 cents:

SELECT `TABLE_NAME` 
FROM `information_schema`.`KEY_COLUMN_USAGE` 
WHERE REFERENCED_TABLE_SCHEMA='DATABASE'
  AND REFERENCED_TABLE_NAME='customers'
  AND REFERENCED_COLUMN_NAME='customer_id'

add the db for insurance (you'll never know when somebody copies the db).

Instead of looking for a column name, here we look at the foreign keys themselves

If you change the on delete restrictions to restrict nothing can be deleted before the children are deleted/migrated

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
borrel
  • 911
  • 9
  • 17
2

The short answer is, no there isn't a better way (that I can think of).

It's a trade off. If you find there are a lot of these instances, it might be worthwhile to invest some time writing a more robust algorithm for checking existing customers prior to adding a new one (i.e. checking variations on first / last names, presenting them to whoever is adding the customer, asking them 2 or 3 times if they are REALLY sure they want to add this new customer, etc.). If there are not a lot of these instances, it might not be worth investing that time.

Short of that, your approach is the only way I can think of. I would actually delete both records, and create a new one with the merged data, resulting in a new customer id rather than re-using an old one, but that's just personal preference - functionally it's the same as your approach. You still have to remember to go back and modify your merge function to reflect new relationships on the customer.id field.

David
  • 2,550
  • 7
  • 27
  • 29
  • out of curiosity, why would you avoid reusing the old id? the way i see it, that just adds an extra step since you have to update two sets of foreign keys instead of just one. – Mike Turley Mar 08 '11 at 19:34
  • You're right - like I said, just personal preference. Conceptually, I see it as creating a new record from two old ones, so I'd like to be able to do a sanity check after a merge operation to verify there are no instances of either of the old ID's left anywhere. Functionally, no different than your approach. – David Mar 08 '11 at 20:03
2

At a minimum, to prevent any triggers on deletions causing some cascading effect, I would FIRST do

update SomeTable set CustomerID = CorrectValue where CustomerID = WrongValue

(do that across all tables)...

THEN Delete from Customers where CustomerID = WrongValue

As for duplicate data... Try to figure out which "Will Smith, Bill Smith, William Smith" if you are lacking certain information... Some could be completely legitimate different people.

DRapp
  • 47,638
  • 12
  • 72
  • 142
  • Good point, I'll switch my order around. As for duplicates, I'm not attempting to auto-detect duplicates, the merges will be performed manually.. so it's up to my administrator to determine whether the people are legitimately different. Merges will mainly be performed if a customer complains that their reward totals are wrong or something, and we realize the "missing" data is due to a duplicate customer. – Mike Turley Mar 08 '11 at 19:36
0

As an update to my comment:

use information_schema;
select table_name from columns where column_name = 'customer_id';

Then loop through the resulting tables and update accordingly.

Personally, I would use your instinctive solution, as this one may be dangerous if there are tables containing customer_id columns that need to be exempt.

sreimer
  • 4,913
  • 2
  • 33
  • 43
  • 1
    Hm. I guess that might be dangerous if there is another database on the same server with customer_id columns. Assuming I want to update customer_id anywhere it appears, Is there a way to limit this query to only tables from a specific database? Or is it dangerous for other reasons too? – Mike Turley Mar 08 '11 at 19:31
  • You can limit this query to a specific database as follows: select table_name from columns where column_name = 'customer_id' and table_schema = 'database_name' – David Mar 08 '11 at 19:59