How to delete duplicate records in mysql database?

Question

What's the best way to delete duplicate records in a mysql database using rails or mysql queries?

you need a keyfield to look for to make the record be distinct, so you can know what to delete by — TStamper, Mar 18 '09 at 20:17

score 10 · Accepted Answer · answered Mar 18 '09 at 20:16

10

What you can do is copy the distinct records into a new table by:

 select distinct * into NewTable from MyTable

answered Mar 18 '09 at 20:16

TStamper

30,098
10
66
73

1

Although this won't catch records which have the same different but different unique ID's? – cliff.meyers Mar 18 '09 at 21:08
that is why I told him in his comment that" you need a keyfield(ID) to look for to make the record be distinct" – TStamper Mar 18 '09 at 21:11
Missed your comment up there :) – cliff.meyers Mar 18 '09 at 21:16

score 8 · Answer 2 · edited May 23 '17 at 12:07

8

Here's another idea in no particular language:

rs = `select a, b, count(*) as c from entries group by 1, 2 having c > 1`
rs.each do |a, b, c|
  `delete from entries where a=#{a} and b=#{b} limit #{c - 1}`
end

Edit:

Kudos to Olaf for that "having" hint :)

edited May 23 '17 at 12:07

Community

1
1

answered Jun 16 '10 at 08:47

krukid

4,285
5
31
30

1

No particular language indeed ;) – allaryin Aug 06 '11 at 15:53
You will need an order by clause in your delete statement or there's no guarantee which of the group will remain. – Dave Apr 09 '13 at 19:04
1

@Dave that's true, you should specify an order column if that's important to your particular task. – krukid Apr 17 '13 at 09:28

score 7 · Answer 3 · answered Mar 18 '09 at 21:11

7

well, if it's a small table, from rails console you can do

class ActiveRecord::Base
  def non_id_attributes
    atts = self.attributes
    atts.delete('id')
    atts
  end
end

duplicate_groups = YourClass.find(:all).group_by { |element| element.non_id_attributes }.select{ |gr| gr.last.size > 1 }
redundant_elements = duplicate_groups.map { |group| group.last - [group.last.first] }.flatten
redundant_elements.each(&:destroy)

answered Mar 18 '09 at 21:11

Maximiliano Guzman

2,035
13
14

Exactly what I was looking for! Rails migration finished! Cheers – Bo Jeanes Sep 15 '10 at 05:22

score 6 · Answer 4 · answered Sep 15 '11 at 20:20

6

Check for Duplicate entries :

SELECT DISTINCT(req_field) AS field, COUNT(req_field) AS fieldCount FROM 
table_name GROUP BY req_field HAVING fieldCount > 1

Remove Duplicate Queries :

DELETE FROM table_name 
USING table_name, table_name AS vtable 
WHERE 
    (table_name.id > vtable.id) 
AND (table_name.req_field=req_field)

Replace req_field and table_name - should work without any issues.

answered Sep 15 '11 at 20:20

DMin

10,049
10
45
65

In postgres, this should be `DELETE FROM table_name USING table_name AS vtable WHERE (table_name.id > vtable.id) AND (table_name.req_field=req_field)` – Ultrasaurus May 13 '12 at 16:08

score 4 · Answer 5 · edited Nov 17 '09 at 20:50

4

New to SQL :-) This is a classic question - often asked in interviews:-) I don't know whether it'll work in MYSQL but it works in most databases -

> create table t(
>     a char(2),
>     b char(2),
>     c smallint )

> select a,b,c,count(*) from t
> group by a,b,c
> having count(*) > 1
a  b  c
-- -- ------ -----------
(0 rows affected)

> insert into t values ("aa","bb",1)
(1 row affected)

> insert into t values ("aa","bb",1)
(1 row affected)

> insert into t values ("aa","bc",1)
(1 row affected)

> select a,b,c,count(*) from t group by a,b,c having count(*) > 1
a  b  c 
-- -- ------ -----------
aa bb      1           2
(1 row affected)

edited Nov 17 '09 at 20:50

Olaf Kock

46,930
8
59
90

answered Mar 18 '09 at 20:54

The formatting is lost - but I'm pretty sure that any programmer will be able to figure it out. – Mar 18 '09 at 20:55
2

I've corrected the formatting. Even though you answered how to 'select' the records where nan wanted to 'delete' them, it's way easier readable now and the answer is related to the underlying problem anyway. – Olaf Kock Nov 17 '09 at 20:53
Thank you very much this query is awesomely working :D > select a,b,c,count(*) from t group by a,b,c having count(*) > 1 – Romans 8.38-39 Jul 23 '14 at 05:53

score 1 · Answer 6 · edited Sep 17 '11 at 12:27

If you have PK (id) in table (EMP) and want to older delete duplicate records with name column. For large data following query may be good approach.

DELETE t3
FROM (
        SELECT t1.name, t1.id
        FROM (
                SELECT name
                FROM EMP
                GROUP BY name
                HAVING COUNT(name) > 1
        ) AS t0 INNER JOIN EMP t1 ON t0.name = t1.name
) AS t2 INNER JOIN EMP t3 ON t3.name = t2.name
WHERE t2.id < t3.id;

score 1 · Answer 7 · edited Sep 17 '11 at 12:30

suppose we have a table name tbl_product and there is duplicacy in the field p_pi_code and p_nats_id in maximum no of count then first create a new table insert the data from existing table ...
ie from tbl_product to newtable1 if anything else then newtable1 to newtable2

CREATE TABLE `newtable2` (                                  
            `p_id` int(10) unsigned NOT NULL auto_increment,         
            `p_status` varchar(45) NOT NULL,                         
            `p_pi_code` varchar(45) NOT NULL,                        
            `p_nats_id` mediumint(8) unsigned NOT NULL,              
            `p_is_special` tinyint(4) NOT NULL,                      
             PRIMARY KEY (`p_id`)                                   
      ) ENGINE=InnoDB;

INSERT INTO newtable1 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT 
    p_status, p_pi_code, p_nats_id, p_is_special FROM tbl_product group by p_pi_code;

INSERT INTO newtable2 (p_status, p_pi_code, p_nats_id, p_is_special) SELECT 
    p_status, p_pi_code, p_nats_id, p_is_special FROM newtable1 group by p_nats_id;

after that we see all the duplicacy in the field is removed

score 0 · Answer 8 · answered Dec 11 '12 at 05:50

I used @krukid's answer above to do the following on a table with around 70,000 entries:

rs = 'select a, b, count(*) as c from table group by 1, 2 having c > 1'

# get a hashmap
dups = MyModel.connection.select_all(rs)

# convert to array
dupsarr = dups.map { |i|  [i.a, i.b, i.c] }

# delete dups
dupsarr.each do |a,b,c|
    ActiveRecord::Base.connection.execute("delete from table_name where a=#{MyModel.sanitize(a)} and b=#{MyModel.sanitize(b)} limit #{c-1}")
end

score 0 · Answer 9 · answered Apr 17 '15 at 19:45

Here is the rails solution I came up with. May not be the most efficient, but not a big deal if its a one time migration.

distinct_records = MyTable.all.group(:distinct_column_1, :distinct_column_2).map {|mt| mt.id}
duplicates = MyTable.all.to_a.reject!{|mt| distinct_records.include? mt.id}
duplicates.each(&:destroy)

First, groups by all columns that determine uniqueness, the example shows 2 but you could have more or less

Second, selects the inverse of that group...all other records

Third, Deletes all those records.

score 0 · Answer 10 · answered Sep 21 '15 at 13:05

Firstly do group by column on which you want to delete duplicate.But I am not doing it with group by.I am writing self join.

You don't need to create the temporary table.

Delete duplicate except one record: In this table it should have auto increment column. The possible solution that I've just come across:

DELETE n1 FROM names n1, names n2 WHERE n1.id > n2.id AND n1.name = n2.name

if you want to keep the row with the lowest auto increment id value OR

DELETE n1 FROM names n1, names n2 WHERE n1.id < n2.id AND n1.name = n2.name

if you want to keep the row with the highest auto increment id value.

You can cross check your solution, find duplicate again:

SELECT * FROM `names` GROUP BY name, id having count(name) > 1;

If it return 0 result, then you query is successful.

score 0 · Answer 11 · answered Jul 30 '10 at 22:20

If your table has a PK (or you can easily give it one), you can specify any number of columns in the table to be equal (to qualify is as a duplicate) with the following query (may be a bit messy looking but it works):

DELETE FROM table WHERE pk_id IN(
   SELECT DISTINCT t3.pk_id FROM (
       SELECT t1.* FROM table AS t1 INNER JOIN (
           SELECT col1, col2, col3, col4, COUNT(*) FROM table
           GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
       ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
       t1.col4 = t2.col4)
   AS t3, (
       SELECT t1.* FROM table AS t1 INNER JOIN (
           SELECT col1, col2, col3, col4, COUNT(*) FROM table
           GROUP BY col1, col2, col3, col4 HAVING COUNT(*)>1) AS t2
       ON t1.col1 = t2.col1 AND t1.col2 = t2.col2 AND t1.col3 = t2.col3 AND
       t1.col4 = t2.col4)
   AS t4
   WHERE t3.col1 = t4.col1 AND t3.pk_id > t4.pk_id

)

This will leave the first record entered into the database, deleting the 'newest' duplicates. If you want to keep the last record, switch the > to <.

score 0 · Answer 12 · answered Aug 24 '10 at 00:01

In MySql when I put something like

delete from A where IDA in (select IDA from A )

mySql said something like "you can't use the same table in the select part of the delete operation."

I've just have to delete some duplicate records, and I have succeeded with a .php program like that

<?php
...
$res = hacer_sql("SELECT MIN(IDESTUDIANTE) as IDTODELETE 
FROM `estudiante` group by `LASTNAME`,`FIRSTNAME`,`CI`,`PHONE`
HAVING COUNT(*) > 1 )");
while ( $reg = mysql_fetch_assoc($res) ) {
   hacer_sql("delete from estudiante where IDESTUDIANTE = {$reg['IDTODELETE']}");
}
?>

score 0 · Answer 13 · edited Sep 17 '11 at 12:27

0

I am using Alter Table

ALTER IGNORE TABLE jos_city ADD UNIQUE INDEX(`city`);

edited Sep 17 '11 at 12:27

LPL

16,827
6
51
95

answered Oct 13 '10 at 12:24

user474440

17
1

3

That does not seem to work, gets me `ERROR 1062 (23000): Duplicate entry '14190-2089' for key 'vid'` – mikl Dec 06 '10 at 14:04
1

This will definitely not remove existing duplicates, only prevent new ones from being inserted – Mithfindel Feb 14 '14 at 16:23

score 0 · Answer 14 · edited May 23 '17 at 12:31

I had to do this recently on Oracle, but the steps would have been the same on MySQL. It was a lot of data, at least compared to what I'm used to working with, so my process to de-dup was comparatively heavyweight. I'm including it here in case someone else comes along with a similar problem.

My duplicate records had different IDs, different updated_at times, possibly different updated_by IDs, but all other columns the same. I wanted to keep the most recently updated of any duplicate set.

I used a combination of Rails logic and SQL to get it done.

Step one: run a rake script to identify the IDs of the duplicate records, using model logic. IDs go in a text file.

Step two: create a temporary table with one column, the IDs to delete, loaded from the text file.

Step three: create another temporary table with all the records I'm going to delete (just in case!).

CREATE TABLE temp_duplicate_models 
  AS (SELECT * FROM models 
  WHERE id IN (SELECT * FROM temp_duplicate_ids));

Step four: actual deleting.

DELETE FROM models WHERE id IN (SELECT * FROM temp_duplicate_ids);

score 0 · Answer 15 · answered Mar 19 '09 at 07:14

0

You can use:

http://lenniedevilliers.blogspot.com/2008/10/weekly-code-find-duplicates-in-sql.html

to get the duplicates and then just delete them via Ruby code or SQL code (I would do it in SQL code but thats up to you :-)

answered Mar 19 '09 at 07:14

Lennie

10,605
5
22
13

How to delete duplicate records in mysql database?

15 Answers15

Check for Duplicate entries :

Remove Duplicate Queries :

Linked

Related