475

How would I delete all duplicate data from a MySQL Table?

For example, with the following data:

SELECT * FROM names;

+----+--------+
| id | name   |
+----+--------+
| 1  | google |
| 2  | yahoo  |
| 3  | msn    |
| 4  | google |
| 5  | google |
| 6  | yahoo  |
+----+--------+

I would use SELECT DISTINCT name FROM names; if it were a SELECT query.

How would I do this with DELETE to only remove duplicates and keep just one record of each?

DineshDB
  • 5,998
  • 7
  • 33
  • 49
Highway of Life
  • 22,803
  • 16
  • 52
  • 80
  • 43
    Duplicate of http://stackoverflow.com/questions/3311903/remove-duplicate-rows-in-mysql and http://stackoverflow.com/questions/2867530/how-to-remove-duplicate-entries-from-a-mysql-db (Ironically.) – dkarp Jan 13 '11 at 21:03
  • 29
    It's not an exact duplicate question, as this asks specifically for a DELETE command to perform the same action that an ALTER command adding a unique index would be needed to have MySQL automatically remove duplicate rows. In this case, we're choosing how exactly we want to delete the duplicates. – Highway of Life Jan 08 '13 at 22:37
  • 1
    So a question about duplicates has duplicates? Hmm – Hamman Samuel Sep 21 '17 at 03:24

2 Answers2

1031

Editor warning: This solution is computationally inefficient and may bring down your connection for a large table.

NB - You need to do this first on a test copy of your table!

When I did it, I found that unless I also included AND n1.id <> n2.id, it deleted every row in the table.

  1. If you want to keep the row with the lowest id value:

    DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name
    
  2. If you want to keep the row with the highest id value:

    DELETE n1 FROM names n1, names n2 WHERE n1.id < n2.id AND n1.name = n2.name
    

I used this method in MySQL 5.1

Not sure about other versions.


Update: Since people Googling for removing duplicates end up here
Although the OP's question is about DELETE, please be advised that using INSERT and DISTINCT is much faster. For a database with 8 million rows, the below query took 13 minutes, while using DELETE, it took more than 2 hours and yet didn't complete.

INSERT INTO tempTableName(cellId,attributeId,entityRowId,value)
    SELECT DISTINCT cellId,attributeId,entityRowId,value
    FROM tableName;
ivan_pozdeev
  • 33,874
  • 19
  • 107
  • 152
martin.masa
  • 10,462
  • 1
  • 15
  • 7
  • 87
    Excellent solution. It worked perfectly. But I have one suggestion here we should swap the conditions. Instead of [WHERE n1.id > n2.id AND n1.name = n2.name] we should write [WHERE n1.name = n2.name AND n1.id > n2.id] it will improve performance if we have so much data. – user307635 Aug 31 '12 at 13:42
  • 12
    FYI: This ignores rows where column "name" is null. – a coder Sep 12 '12 at 14:49
  • 3
    http://sqlfiddle.com/#!2/e5df9 for a schema example. May need to load in your instance of MySQL. Then execute: SELECT * FROM deldup; DELETE n1 FROM deldup n1, deldup n2 WHERE n1.or_id < n2.or_id AND n1.order_id = n2.order_id AND n1.txt_value = n2.txt_value AND n1.date_of_revision = n2.date_of_revision AND n1.status = n2.status; SELECT * FROM deldup; – a coder Sep 12 '12 at 14:58
  • 1
    @user307635 I think that the mysql optimizer will handle the correct order? – Peterdk May 26 '13 at 22:02
  • When I try to replace `DELETE` with `SELECT` i get a `unknown column n1 in field list` ? Does this only work with `DELETE`? (mysql 5.5) – Peterdk May 26 '13 at 22:14
  • @Peterdk DELETE takes a table name (required if you specify more than one table), SELECT takes column name(s), or try "SELECT * ...". – toddkaufmann Sep 19 '13 at 22:21
  • 5
    The NB in this answer is VERY IMPORTANT kids. But this is an excellent of MySQL. NOTE that for tables that could have duplicates repeated more than once you will also want a `GROUP BY` n1.id clause. – usumoio Oct 19 '13 at 00:00
  • 8
    I love this solution, but do you have a suggestion for optimizing it on larger tables? – Argus9 Nov 01 '13 at 17:34
  • So how does this work with large tables? I have 400k+ rows. It works on the sqlfiddle given above though. – kapitanluffy Nov 15 '13 at 18:14
  • 1
    Be careful! MySQL's `=` sign is not case sensitive. This will delete `Test Name` as it will match `test name`. – Dean Rather Feb 04 '14 at 05:04
  • what does n1 represents? is it the column? – mmr Mar 18 '14 at 08:42
  • @mmr n1 is the table alias (...FROM names n1) – martin.masa Mar 19 '14 at 08:41
  • Thank you! This works perfectly when `ALTER IGNORE TABLE table ADD UNIQUE INDEX index1(table.column_name)` doesn't work – Dylan Pierce Aug 08 '14 at 17:33
  • somehow it didnt work, for me it deleted all values which had duplicates. The below query worked fine. – Srihari Karanth Aug 09 '14 at 09:02
  • 2
    Dude, with a "n.b." like that you should put that BEFORE everything else! – Stormsson Mar 05 '15 at 16:52
  • What could happen if `n1.id` and `n2.id` might also be duplicated? – CMCDragonkai Jul 06 '15 at 15:51
  • 11
    This took 171 seconds on a 10,000 record table with 450 duplicates. The answer by OMG Ponies took 4 seconds. – David Gleba Jul 06 '15 at 19:42
  • No doubt, its the most elegant and excellent solution...saved my time :) – Zaffar Saffee Aug 24 '15 at 21:28
  • and I am running it every few days...thinking to write a small script to do it via cron..thanks a million for this straightforward solution..would that I could more than one upvote option – Zaffar Saffee Mar 09 '16 at 18:54
  • This query doesn't work on MySQL version: 5.5.47-0ubuntu0.14.04.1 – RN Kushwaha Apr 06 '16 at 10:32
  • on mysql 5.7 (centos) and this doesn't seem to work properly – vknyvz Dec 14 '16 at 22:08
  • The DISTINCT solution is great, but be careful. If you only had two columns, an ID and a name, you could condense the rows by names - but what if you had other fields with divergent values (which you could tolerate losing one or more duplicates of)? In this case `DISTINCT` won't work because of the extra field. In this case use `INSERT INTO tmp SELECT .. GROUP BY..`. _ALSO_, be careful on your rename sequence and your innoDB foreign key references!! And make sure you don't lose any indexes. All good things to check. – Oliver Williams May 26 '17 at 03:38
  • OH, and on my method above, use MIN(id) if you want to select the first, MAX(id) if you want to select that last. – Oliver Williams Jun 16 '17 at 17:12
  • 1
    In case anyone is interested, here (https://stackoverflow.com/questions/3311903/remove-duplicate-rows-in-mysql/47392593#47392593) I explain how to use a temporary table to delete MySQL duplicates in a reliable and fast way (with examples for different use cases). – César Revert-Gomar Nov 20 '17 at 16:39
  • Great! But, is there any faster solution? for example for a table with 2 Millions Records. – Muaaz Khalid Apr 21 '18 at 18:51
  • This should not be selected answer. I have very short datas on my table with 100,000 rows, and your query made my server down with mysql too many connections error. I waited 10 minutes to see the results but it didnt. I ended up with stopping the query and restarting my mysql server to get my server back. – user198989 May 09 '18 at 22:38
  • 1
    @user198989 - agreed; OMG Ponies answer is inherently faster on large tables. Especially true if there are multiple duplicates of each record. – ToolmakerSteve Mar 08 '19 at 14:56
  • This will not work if there are more than 2 duplicate rows. please check again. – user3065757 Dec 01 '19 at 09:30
  • @martin.masa May I know how about the update statement?Something like soft delete, set a date in columns deleted_at – scarlet pro Mar 24 '20 at 12:57
  • The INSERT INTO with SELECT DISTINCT method worked perfectly, and has been said here, quickly. For whatever reason this was the only method I tried in a long annoying day that did what was promised. – Byterbit May 05 '20 at 23:48
243

If you want to keep the row with the lowest id value:

DELETE FROM NAMES
 WHERE id NOT IN (SELECT * 
                    FROM (SELECT MIN(n.id)
                            FROM NAMES n
                        GROUP BY n.name) x)

If you want the id value that is the highest:

DELETE FROM NAMES
 WHERE id NOT IN (SELECT * 
                    FROM (SELECT MAX(n.id)
                            FROM NAMES n
                        GROUP BY n.name) x)

The subquery in a subquery is necessary for MySQL, or you'll get a 1093 error.

OMG Ponies
  • 325,700
  • 82
  • 523
  • 502
  • 9
    What does the 'x' do ? – GDmac Feb 15 '14 at 10:29
  • 9
    @GDmac it serves as an alias for inner query. If not specified, an error will be thrown. – rpozarickij Mar 13 '14 at 10:08
  • 3
    This is a much better solution. The one by masa-255 doesn't work. What is the x for? – wbinky Jun 20 '14 at 11:15
  • 7
    @wbinky it serves as an alias for inner query. If not specified, an error will be thrown. – mraaroncruz Sep 15 '14 at 10:33
  • 50
    But what's the x for? (just kidding) – tumultous_rooster May 25 '15 at 05:03
  • 11
    it seems this sql delete unique rows too. actualy all rows – mst Jul 14 '15 at 11:28
  • what if i have same id – Asmita Nov 17 '15 at 06:01
  • 3
    This is a better solution than masa-255's one if you have big data. Thanks. – namal Dec 04 '15 at 13:59
  • Actually it just like use temporary table – onebraveman Apr 14 '16 at 13:40
  • @GDmac : Alias is required for every derived table. x is here serving as an alias for the table derived from the custom query. – Suraj Kumar Sep 15 '16 at 08:23
  • 1
    @Asmita you can use `DELETE FROM NAMES WHERE id NOT IN (SELECT * FROM (SELECT MAX(n.id) FROM NAMES n GROUP BY n.id) x)` – Mostafa Dec 03 '16 at 06:12
  • 1
    It is executed very long – MisterPi Dec 06 '16 at 06:34
  • 6
    `x` could be written as `AS x` or `AS temp` or `AS \`temp\`` to make it more clear it's a random alias name for the table generated by the outer select. MySQL won't let you use the same table for both an action (DELETE, UPDATE) and a condition, so wrapping the condition in another SELECT allows a temp table to be created and given an alias. – Slam Oct 08 '17 at 16:40
  • 3
    This took only 8 seconds to a table with 500K records. Thanks. – vthallam Jan 18 '18 at 16:11
  • 3
    This was super fast compared to the accepted answer! – xssChauhan May 09 '18 at 07:17
  • This should be the accepted answer as it is much much much faster – Tainmar Sep 05 '19 at 10:56
  • Superfast, but this has also deleted unique rows too. It cost me 20K+ unique data deleted with this query. – Asfandyar Khan Nov 18 '19 at 11:28
  • 1
    @AsfandyarKhan - To avoid deleting rows unintentionally, the query *must* return a row for *every* existing value of the `group by` field(s). That is, this approach *must not* be used with a more complicated inner query. For example, if one alters the query to only select rows from a given id range or date range (or other limited part of the table), then the `DELETE` will delete all data not in the specified range. In more complicated situations, it is best to make a temporary table containing IDs to be deleted. Then if a row shouldn't be in that temp table, a 2nd query can correct it. – ToolmakerSteve Feb 24 '20 at 00:10
  • This solution gives me the error 'SQL Error [1093] [HY000]: You can't specify target table 'NAMES' for update in FROM clause' – Adiyya Tadikamalla Mar 21 '23 at 11:37
  • Why did you use two selects? I removed the outer one (SELECT * FROM) and it worked just fine. – Vahidrk Apr 30 '23 at 07:42