2

I have a MySQL database with 1000s of personnel records, often with duplicates.

For each case with at least one duplicate I want to be able to delete all of the duplicates but one, then update any references to those deleted foreign keys with the one I did not.

For example, we see two instances of Star Lord below:

+-----------------------+
|        `users`        |
+------+----------------+
| id   | name           |
+------+----------------+
| 1    | Star Lord      |
+------+----------------+
| 2    | Star Lord      |
+------+----------------+
| 3    | Iron Man       |
+------+-----+----------+

+-----------------------+
|       `messages`      |
+------+-----+----------+
| from | to  | text     |
+------+-----+----------+
| 1    | 5   | hi       |
+------+-----+----------+
| 2    | 5   | how r u  |
+------+-----+----------+
| 5    | 2   | Good, u? |
+------+-----+----------+

Those two tables should become:

+-----------------------+
|        `users`        |
+------+----------------+
| id   | name           |
+------+----------------+
| 1    | Star Lord      |
+------+----------------+
| 3    | Iron Man       |
+------+-----+----------+

+-----------------------+
|       `messages`      |
+------+-----+----------+
| from | to  | text     |
+------+-----+----------+
| 1    | 5   | hi       |
+------+-----+----------+
| 1    | 5   | how r u  |
+------+-----+----------+
| 5    | 1   | Good, u? |
+------+-----+----------+

Can this be done? I'm happy to use PHP as needed.

I found the following, but it's only for finding foreign key usage, not replacing instances for specific key values: MySQL: How to I find all tables that have foreign keys that reference particular table.column AND have values for those foreign keys?

Bonus Points

There may be additional data which needs to be merged in the users table. For example, Star Lord with ID #1 might have a phone field filled in, but Star Lord with ID #2 has an email field.

Worst case: they both have a field, with conflicting data.

Community
  • 1
  • 1
Bing
  • 3,071
  • 6
  • 42
  • 81
  • Unless you are going for human intervention, bonus points is near unsolvable. You could do the nulls, but unless you can come up with a rule for each conflict, it's no can do. – Tony Hopkinson Apr 05 '15 at 08:38

2 Answers2

3

I suggest:

  1. Create a table of correct data. A good starting point might be:

    CREATE TABLE users_new LIKE users;
    ALTER  TABLE users_new ADD UNIQUE (name);
    
    INSERT INTO users_new
      (id, name, phone, email)
    SELECT   MIN(id), name, GROUP_CONCAT(phone), GROUP_CONCAT(email)
    FROM     users
    GROUP BY name;
    

    Note that, due to your "worst case" observation under "Bonus Points", you may well want to manually verify the contents of this table before archiving the underlying users data (I advise against permanent deletion, just in case).

  2. Update existing foreign relationships:

    UPDATE messages
      JOIN (users uf JOIN users_new unf USING (name)) ON uf.id = messages.from
      JOIN (users ut JOIN users_new unt USING (name)) ON ut.id = messages.to
    SET    messages.from = unf.id,
           messages.to   = unt.id
    

    If you have a lot of tables to update, you could cache the results of the join between users and users_new—either:

    • in a new_id column within the old users table:

      ALTER TABLE users ADD new_id BIGINT UNSIGNED;
      
      UPDATE users JOIN users_new USING (name)
      SET    users.new_id = users_new.id;
      
      UPDATE messages
        JOIN users uf ON uf.id = messages.from
        JOIN users ut ON ut.id = messages.to
      SET    messages.from = uf.new_id,
             messages.to   = ut.new_id;
      
    • or else in a new (temporary) table:

      CREATE TEMPORARY TABLE newid_cache (
        PRIMARY KEY(old_id),
        KEY(old_id, new_id)
      ) ENGINE=MEMORY
      SELECT users.id AS old_id, users_new.id AS new_id
      FROM   users JOIN users_new USING (name);
      
      UPDATE messages
        JOIN newid_cache nf ON nf.old_id = messages.from
        JOIN newid_cache nt ON nt.old_id = messages.to
      SET    messages.from = nf.new_id,
             messages.to   = nt.new_id;
      
  3. Either replace users with users_new, or else modify your application to use the new table in place of the old one.

    ALTER TABLE users     RENAME TO users_old;
    ALTER TABLE users_new RENAME TO users;
    
  4. Update any foreign key constraints as appropriate.

eggyal
  • 122,705
  • 18
  • 212
  • 237
  • @TonyHopkinson: But the original table is never modified (except, optionally, for adding a `new_id` column; and the final renaming step)—the underlying data is never touched. – eggyal Apr 05 '15 at 08:43
  • And then they delete users_old, a day later someone rings up and says something like 'all my messages' have gone. Then you go into full disaster mode. You are good enough to do it a oner, Careless, incompetents like myself, need a backstop. :( – Tony Hopkinson Apr 05 '15 at 08:50
  • @TonyHopkinson: My answer already says "*...verify the contents of this table before archiving the underlying `users` data (I advise against permanent deletion, just in case).*" – eggyal Apr 05 '15 at 08:51
  • Hmm attitude difference, I'd advise taking a backup, just in case, just in case. No big thing, its just a reflex on my part having been called in to undo a what turned out to be invalid data fix. In fact I'd probably do a back up after as well. – Tony Hopkinson Apr 05 '15 at 09:00
0

I like to be really methodical about this, while you could write it all in one complex query, that's an optimisation, and unless it's obvious, an unnecessary one.

First backup your database :) Create a table to hold the ids of the users you are going to keep.

Fill it with say

Insert into Keepers Select keep_id From (Select Min(id) as keep_id,`name` From `users`)

After that it's just some update with joins.

e.g.

UPDATE 
   `messages` m JOIN
   keepers k 
      ON k.keeper_id = m.from 
SET m.from = k.keeper_id

UPDATE 
   `messages` m JOIN
   keepers k 
      ON k.keeper_id = m.to 
SET m.to = k.keeper_id

Then get rid of the users you don't want

Delete `users`
from `users` u
outer join keepers on k.keeper_id = u.id
where i.id is null

When all is good e.g you have the same number of messages as you started with, no one is talking to themselves etc.

Delete the keepers table.

Syntax not checked, but it should be close.

Tony Hopkinson
  • 20,172
  • 3
  • 31
  • 39