MySQL delete duplicate rows in table

Question

What I have here is a table with around 600.000 rows. In the table I have an integer column "uid" and a varchar column "ex_id" which is the one that has duplicate entries (at least 2 times the same "ex_id"). The problem is, that I need the newer entry of the dublets.

So I figured I could join with another table (also around 600.000 rows), where I have a Datetime column and also the "uid" column. Is there a way to do this in SQL so that it doesn't take forever to run that command? I've tried different approaches with JOINS and GROUP BYs but either they didn't work or they ran forever and never got a working result.

TLDR: I want to delete older duplicates in a 600.000 rows table and have to join with another 600.000 rows table to find the older ones. Is this possible to do in SQL with feasible runtime or do I have to use PHP?

EDIT: A little more information on the table structure:

[
{
    'id' : 1,
    'select_type' : 'SIMPLE',
    'table' : 'a',
    'type' : 'ALL',
    'possible_keys' : NULL,
    'key' : NULL,
    'key_len' : NULL,
    'ref' : NULL,
    'rows' : 569418,
    'Extra' : ''
},
{
    'id' : 1,
    'select_type' : 'SIMPLE',
    'table' : 'x',
    'type' : 'eq_ref',
    'possible_keys' : 'PRIMARY',
    'key' : 'PRIMARY',
    'key_len' : '8',
    'ref' : '---.a.user_id',
    'rows' : 1,
    'Extra' : 'Using where'
},
{
    'id' : 1,
    'select_type' : 'SIMPLE',
    'table' : 'b',
    'type' : 'ALL',
    'possible_keys' : NULL,
    'key' : NULL,
    'key_len' : NULL,
    'ref' : NULL,
    'rows' : 569418,
    'Extra' : 'Using where; Using join buffer'
},
{
    'id' : 1,
    'select_type' : 'SIMPLE',
    'table' : 'y',
    'type' : 'eq_ref',
    'possible_keys' : 'PRIMARY',
    'key' : 'PRIMARY',
    'key_len' : '8',
    'ref' : '---.b.user_id',
    'rows' : 1,
    'Extra' : 'Using where'
}
]

An newer entry will always has the highest uid. So what is the need to join ? — unbesiegbar, Jun 13 '14 at 07:18
The problem is, that the higher-uid entry is not always the one I want to keep, hence the need to look for the latest Datetime in the other table! — plocks, Jun 13 '14 at 07:19
I am wondering if you could just select the newer records rather the deleting older records in a temporary table. Did you try this ? — unbesiegbar, Jun 13 '14 at 07:25
@Ajit No, I only have an auto-incremented Primary key in each row. — plocks, Jun 13 '14 at 07:42
Might be this link helpful to you http://stackoverflow.com/questions/9239762/find-duplicates-in-the-same-table-in-mysql — Sadikhasan, Jun 13 '14 at 08:02
@Sadikhasan I've already seen this link and no, it did not help me further. — plocks, Jun 13 '14 at 08:23
@Devan. please post your select query with joins. I think proper indexing will fix your issue. — unbesiegbar, Jun 13 '14 at 08:24
That's the subquery SELECT, but it takes much too long and doesn't finish. SELECT * FROM table_1 a INNER JOIN table_1 b ON a.ex_id = b.ex_id WHERE a.uid <> b.uid — plocks, Jun 13 '14 at 08:32
alter table t1 add index(uid) and alter table t2 add index(uid) execute these commands and run the query again. — unbesiegbar, Jun 13 '14 at 09:39

score 1 · Accepted Answer · answered Jun 13 '14 at 08:49

1

This should be possible with a join. However no idea what you table structures and indexes are.

As an example this should get you a list of the uids that are duplicates:-

SELECT b.uid
FROM table1 a
INNER JOIN table2 x
ON a.uid = x.uid
INNER JOIN table1 b
ON a.ex_id = b.ex_id
INNER JOIN table2 y
ON b.uid = y.uid
WHERE x.datetime > y.datetime

From that you should be able to do a delete:-

DELETE b
FROM table1 a
INNER JOIN table2 x
ON a.uid = x.uid
INNER JOIN table1 b
ON a.ex_id = b.ex_id
INNER JOIN table2 y
ON b.uid = y.uid
WHERE x.datetime > y.datetime

Note that the indexes available will make a massive difference to the performance of this query (or any other similar one).

answered Jun 13 '14 at 08:49

Kickstart

21,403
2
21
33

1

I appreciate your work, because it actually works :), but sadly it takes much too long to run on the amount of data that I have to process. – plocks Jun 13 '14 at 09:17
Can you post the table declares (and an explain of the above query). – Kickstart Jun 13 '14 at 09:19
table declares: id bigint(20); discriminator enum; uid bigint(20); ex_id varchar(255); flag tinyint(1); hash binary(20); i don't know what you mean by "(and an explain of the above query)" – plocks Jun 13 '14 at 09:34
I need to know the indexes. If you type `EXPLAIN` before the rest of your query (the SELECT one, not the DELETE) and try and execute it like that it will instead provide details of now the tables are joined, what indexes are used, etc. If you can do this and post the results it would be helpful. – Kickstart Jun 13 '14 at 09:43
I've updated my question with the information that you wanted. – plocks Jun 13 '14 at 09:48
Add an index on the ex_id column to your main table. – Kickstart Jun 13 '14 at 09:55
Since that takes longer than 10 minutes which is unacceptable, I assume it won't be able to run in feasible time. Or is there something else I could try? – plocks Jun 13 '14 at 10:08
Adding an index takes 10 minutes! Ouch. Shouldn't take that long. However without an index the query will be slow. You could do it with a script dealing with a record at a time which will take far longer over all but at least will not lock the database up while it runs. – Kickstart Jun 13 '14 at 10:15
Yeah I did it that way now. But thanks anyways for the impressive query! – plocks Jun 13 '14 at 18:53

MySQL delete duplicate rows in table

1 Answers1