3

We have a table business_users with a user_id and business_id and we have duplicates. How can I write a query that will delete all duplicates except for one?

MvG
  • 57,380
  • 22
  • 148
  • 276
Matt Elhotiby
  • 43,028
  • 85
  • 218
  • 321
  • 1
    Click through the related questions. I found a bunch of ideas to try a few weeks ago when I was looking for this problem. I mixed and matched a few to get the desired results. – MetalFrog Sep 18 '12 at 18:39
  • 2
    Do you have any primary key, or other unique constraint, on this table? Or are `user_id` and `business_id` the only columns, such that entire rows are duplicated? – ruakh Sep 18 '12 at 18:43
  • Looks like a duplicate of http://stackoverflow.com/questions/672702/how-to-delete-duplicates-in-mysql-table?rq=1 – cptScarlet Sep 18 '12 at 18:44
  • yes there is a `id` as the primary key – Matt Elhotiby Sep 18 '12 at 18:52

2 Answers2

10

Completely identical rows

If you want to avoid completely identical rows, as I understood your question at first, then you can select unique rows to a separate table and recreate the table data from that.

CREATE TEMPORARY TABLE tmp SELECT DISTINCT * FROM business_users;
DELETE FROM business_users;
INSERT INTO business_users SELECT * FROM tmp;
DROP TABLE tmp;

Be careful if there are any foreign key constraints referencing this table, though, as the temporary deletion of rows might lead to cascaded deletions elsewhere.

Introducing a unique constraint

If you only care about pairs of user_id and business_id, you probably want to avoid introducing duplicates in the future. You can move the existing data to a temporary table, add a constraint, and then move the table data back, ignoring duplicates.

CREATE TEMPORARY TABLE tmp SELECT * FROM business_users;
DELETE FROM business_users;
ALTER TABLE business_users ADD UNIQUE (user_id, business_id);
INSERT IGNORE INTO business_users SELECT * FROM tmp;
DROP TABLE tmp;

The above answer is based on this answer. The warning about foreign keys applies just as it did in the section above.

One-shot removal

If you only want to execute a single query, without modifying the table structure in any way, and you have a primary key id identifying each row, then you can try the following:

DELETE FROM business_users WHERE id NOT IN
    (SELECT MIN(id) FROM business_users GROUP BY user_id, business_id);

A similar idea was previously suggested by this answer.

If the above request fails, because you are not allowed to read and delete from a table in the same step, you can again use a temporary table:

CREATE TEMPORARY TABLE tmp
SELECT MIN(id) id FROM business_users GROUP BY user_id, business_id;
DELETE FROM business_users WHERE id NOT IN (SELECT id FROM tmp);
DROP TABLE tmp;

If you want to, you can still introduce a uniqueness constraint after cleaning the data in this fashion. To do so, execute the ALTER TABLE line from the previous section.

Community
  • 1
  • 1
MvG
  • 57,380
  • 22
  • 148
  • 276
  • I like the last one but i get You can't specify target table 'business_users' for update in FROM clause – Matt Elhotiby Sep 18 '12 at 19:13
  • @Trace, I added a version which should avoid that problem. – MvG Sep 18 '12 at 19:32
  • Just out of curiousity, for the one-shot removal, why does the first example have `SELECT MIN(id) FROM` and the second one have `SELECT MIN(id) id FROM` (the second has two 'id's)? – Pete May 25 '16 at 03:36
  • 2
    @Pete: The `MIN(id) id` in the second is an abbreviation of `MIN(id) AS id`: it specifies the name of the column, so that the column in the resulting table isn't literally named `MIN(id)` which would be quite confusing and hard to type. In the first query, the name of the column does not matter, since the subquery is just used as a set. – MvG May 25 '16 at 08:02
3

Since you have a primary key, you can use that to pick which rows to keep:

delete from business_users
where id not in (
    select id from (
        select min(id) as id -- Make a list of the primary keys to keep
        from business_users
        group by user_id, business_id -- Group by your duplicated row definition
    ) as a -- Derived table to force an implicit temp table
);

In this way, you won't need to create/drop temp tables and such (except the implicit one).

You might want to put a unique constraint on user_id, business_id so you don't have to worry about this again.

Community
  • 1
  • 1
Tim Lehner
  • 14,813
  • 4
  • 59
  • 76
  • looks great but i get this You can't specify target table 'business_users' for update in FROM clause – Matt Elhotiby Sep 18 '12 at 19:11
  • @Trace, sorry...I've updated to make a subquery work in mysql in this scenario. – Tim Lehner Sep 18 '12 at 19:26
  • Note: I've read the same suggestion about using a subquery, but it failed in my own test setup. Seems to be due to the fact that I created `business_users` as a temporary table as well, for testing. In that case, the error is phrased `Can't reopen table: 'business_users'` which amounts to pretty much the same problem (at least in my eyes), but cannot be avoided by introducing yet another subquery. – MvG Sep 18 '12 at 19:46
  • Interesting. Here is my [test sqlfiddle](http://sqlfiddle.com/#!2/29b20/1). Could you possibly give us a better definition of your existing schema that's throwing the error? Perhaps you will need to put the primary keys you wish to keep into a temp table. – Tim Lehner Sep 18 '12 at 19:51