2

I have a database table that was generated by importing several thousand text documents each very large. For some reason, some files were imported multiple times.

I am trying to remove duplicate rows by using following query:

ALTER IGNORE TABLE mytable ADD UNIQUE INDEX myindex (LASTNAME, FIRSTNAME, HOUSENUMBER, STREET, CITY, ZIP, DOB, SEX);

but I was getting an error

1062 - Duplicate entry

Apparently, IGNORE has been deprecated.

How can I remove duplicates from my database?

I guess I have to do a DELETE with a JOIN but I can't figure out the code.

The table is InnoDB and currently has about 40,000,000 rows (there should be about 17,000,000). Each row has a primary key.

Considering the size, I am hesitant to temporally change the table to MyISAM.

Aatish Sai
  • 1,647
  • 1
  • 26
  • 41
tcam27
  • 95
  • 7
  • 1
    Do none of these options work for you? http://stackoverflow.com/questions/2630440/how-to-delete-duplicates-on-a-mysql-table – Tim Biegeleisen Jan 15 '17 at 04:39
  • Looks like I can combine a few of the options into a looped query. My hosting provider limits queries to about 30 seconds so some of the options alone wouldn't work. – tcam27 Jan 16 '17 at 21:40

3 Answers3

0

Each row has a primary key

Is a unique number?

Create an AUX table like this(assuming ID is the PK):

create table mytable_aux as (
select LASTNAME, FIRSTNAME, HOUSENUMBER, STREET, CITY, ZIP, DOB, SEX, MIN(ID)
from mytable 
group by LASTNAME, FIRSTNAME, HOUSENUMBER, STREET, CITY, ZIP, DOB, SEX);

Then delete everything that is not in aux table:

delete from mytable  where id not in (select aux.id from mytable_aux aux) ;
Mariano L
  • 1,809
  • 4
  • 29
  • 51
0

Assuming it is just one table and you have the SQL dump available...

  • CREATE the table with all the relationships established but no data inserted. Keep the INSERT statements stored in a separate .sql file.

  • Change all the INSERT statements to INSERT IGNORE.

  • Import the updated .sql file containing only the INSERT IGNORE statements. The duplicates will be automatically ignored.

Please note that, without comparing manually, you won't be able to figure out which or how many records were ignored.

However, if you're absolutely sure that you really don't need the duplicates based on the relationships defined on the table, then this approach works fairly well.


Also, if you'd like to do the same with multiple tables, you'll have to make sure that you CREATE all the tables at the start, define the foreign keys / dependencies AND, most importantly, arrange the new .sql file in such a manner that the table that has no dependency gets the INSERT statements loaded first. Likewise, the last set of INSERT statements will be for the table with the most number of dependencies.

Hope that helps.

Dhruv Saxena
  • 1,336
  • 2
  • 12
  • 29
0

If those are the only fields in your table you can always:

create table temp_unique as
select distinct LASTNAME, FIRSTNAME, HOUSENUMBER, STREET, CITY, ZIP, DOB, SEX
from mytable

then rename (or drop if you dare) mytable and rename temp_unique to mytable, then create your indexes (make sure to create any other indexes or FKs or whatever that already exist).

If you're working on a live table you'll have to delete the underlying records one at a time. That's quite a bit different -- add a uid then perform deletes. If that's your situation, let us know, we can refactor.

Chipmonkey
  • 863
  • 7
  • 18