1

What I have here is a table with around 600.000 rows. In the table I have an integer column "uid" and a varchar column "ex_id" which is the one that has duplicate entries (at least 2 times the same "ex_id"). The problem is, that I need the newer entry of the dublets.

So I figured I could join with another table (also around 600.000 rows), where I have a Datetime column and also the "uid" column. Is there a way to do this in SQL so that it doesn't take forever to run that command? I've tried different approaches with JOINS and GROUP BYs but either they didn't work or they ran forever and never got a working result.

TLDR: I want to delete older duplicates in a 600.000 rows table and have to join with another 600.000 rows table to find the older ones. Is this possible to do in SQL with feasible runtime or do I have to use PHP?

EDIT: A little more information on the table structure:

[
{
    'id' : 1,
    'select_type' : 'SIMPLE',
    'table' : 'a',
    'type' : 'ALL',
    'possible_keys' : NULL,
    'key' : NULL,
    'key_len' : NULL,
    'ref' : NULL,
    'rows' : 569418,
    'Extra' : ''
},
{
    'id' : 1,
    'select_type' : 'SIMPLE',
    'table' : 'x',
    'type' : 'eq_ref',
    'possible_keys' : 'PRIMARY',
    'key' : 'PRIMARY',
    'key_len' : '8',
    'ref' : '---.a.user_id',
    'rows' : 1,
    'Extra' : 'Using where'
},
{
    'id' : 1,
    'select_type' : 'SIMPLE',
    'table' : 'b',
    'type' : 'ALL',
    'possible_keys' : NULL,
    'key' : NULL,
    'key_len' : NULL,
    'ref' : NULL,
    'rows' : 569418,
    'Extra' : 'Using where; Using join buffer'
},
{
    'id' : 1,
    'select_type' : 'SIMPLE',
    'table' : 'y',
    'type' : 'eq_ref',
    'possible_keys' : 'PRIMARY',
    'key' : 'PRIMARY',
    'key_len' : '8',
    'ref' : '---.b.user_id',
    'rows' : 1,
    'Extra' : 'Using where'
}
]
plocks
  • 561
  • 8
  • 25

1 Answers1

1

This should be possible with a join. However no idea what you table structures and indexes are.

As an example this should get you a list of the uids that are duplicates:-

SELECT b.uid
FROM table1 a
INNER JOIN table2 x
ON a.uid = x.uid
INNER JOIN table1 b
ON a.ex_id = b.ex_id
INNER JOIN table2 y
ON b.uid = y.uid
WHERE x.datetime > y.datetime

From that you should be able to do a delete:-

DELETE b
FROM table1 a
INNER JOIN table2 x
ON a.uid = x.uid
INNER JOIN table1 b
ON a.ex_id = b.ex_id
INNER JOIN table2 y
ON b.uid = y.uid
WHERE x.datetime > y.datetime

Note that the indexes available will make a massive difference to the performance of this query (or any other similar one).

Kickstart
  • 21,403
  • 2
  • 21
  • 33
  • 1
    I appreciate your work, because it actually works :), but sadly it takes much too long to run on the amount of data that I have to process. – plocks Jun 13 '14 at 09:17
  • Can you post the table declares (and an explain of the above query). – Kickstart Jun 13 '14 at 09:19
  • table declares: id bigint(20); discriminator enum; uid bigint(20); ex_id varchar(255); flag tinyint(1); hash binary(20); i don't know what you mean by "(and an explain of the above query)" – plocks Jun 13 '14 at 09:34
  • I need to know the indexes. If you type `EXPLAIN` before the rest of your query (the SELECT one, not the DELETE) and try and execute it like that it will instead provide details of now the tables are joined, what indexes are used, etc. If you can do this and post the results it would be helpful. – Kickstart Jun 13 '14 at 09:43
  • I've updated my question with the information that you wanted. – plocks Jun 13 '14 at 09:48
  • Add an index on the ex_id column to your main table. – Kickstart Jun 13 '14 at 09:55
  • Since that takes longer than 10 minutes which is unacceptable, I assume it won't be able to run in feasible time. Or is there something else I could try? – plocks Jun 13 '14 at 10:08
  • Adding an index takes 10 minutes! Ouch. Shouldn't take that long. However without an index the query will be slow. You could do it with a script dealing with a record at a time which will take far longer over all but at least will not lock the database up while it runs. – Kickstart Jun 13 '14 at 10:15
  • Yeah I did it that way now. But thanks anyways for the impressive query! – plocks Jun 13 '14 at 18:53