0

On a table with about 1.7M rows, I tried to delete duplicates posts:

delete a FROM comment a
  INNER JOIN comment a2
     WHERE a.id < a2.id
     AND   a.body = a2.body;

The result was:

  Query OK, 35071 rows affected (5 hours 36 min 48.79 sec)

This happened on my almost idle workstation with Intel(R) Core(TM) i7-4790 CPU @ 3.60GHz. I'm wondering if there are some tricks to make this delete operation faster?

Milkyway
  • 705
  • 3
  • 12
  • 25
  • Its not good idea to delete them all in one go, try incremental delete with a filter clause ex: using a date column in `filer` – Ven Jun 07 '19 at 09:57
  • how many rows you have in your table comment ?? .. you leave on db only the last comment for duplicate by name? – ScaisEdge Jun 07 '19 at 09:57
  • @Ven, please elaborate with a sample query. – Milkyway Jun 07 '19 at 10:04
  • @scaisEdge, as I said, there are roughly 1.7 million rows. – Milkyway Jun 07 '19 at 10:05
  • 1
    @Milkyway if you have a column a table that can be used for deleting records, i have solved this issue in production this way. As this will not cause issues on log file growth ex: `delete a from comment a inner join comment a2 on a.body = a2.body and a.datecolumn between '2018-01-06' and '2019-01-06'` – Ven Jun 07 '19 at 10:09

3 Answers3

0

For MySQL specifically you can try (assuming rows have the exact same columns information):

ALTER IGNORE TABLE comment ADD UNIQUE INDEX idx_name (id, body);

Source

iamdlm
  • 1,885
  • 1
  • 11
  • 21
  • Not sure how to implement that in my particular use case. – Milkyway Jun 07 '19 at 10:07
  • This does not work for mysql. I get `#1327 - Undeclared variable: NewTable`, despite the fact that I've created it: `CREATE TABLE `NewTable` LIKE `comment`;` Also the rename command at mysql should be `RENAME TABLE `NewTable` TO `comment`;` – Milkyway Jun 07 '19 at 10:23
  • This answer has nothing to do with the question's requirement. – forpas Jun 07 '19 at 10:24
  • Whats Performance hit adding index on table with 1.7 million rows, and how long does it take to execute ? – Ven Jun 07 '19 at 10:28
0

The below query will be useful for you.

Delete  YourTableName 
From    (
Select  row_number() over(Partition by ColName1,ColName2,ColName3 order by ColName1,ColName2,ColName3 Asc)As RowNumber
        )YourTableName
Where   YourTableName.RowNumber>1

if it's working kindly mark as answer

forpas
  • 160,666
  • 10
  • 38
  • 76
0

Your query is attempting a zillion deletes for the same row. For instance, if you have this data:

body   id
  a     1
  a     2
  a     3
  a     4

Then your query attempts the following deletions:

 c.body   c.id  c2.id
  a         1      4
  a         1      3
  a         1      2
  a         2      4
  a         2      3
  a         3      4

You can see how this would result in lots of work for the database, as the number of ids on a given body increase.

You can fix this using group by instead:

delete c 
    from comment c join
         (select c2.body, max(c2.id) as max_id
          from comment c2
          group by c2.body
         ) c2
         on c2.body = c.body and c.id < c2.max_id;

In addition, you want an index on comment(body, id).

You might also find that an anti-join works better than the join you are attempting:

delete c 
    from comment c left join
         comment c2
         on c2.body = c.body and c2.id > c.id
    where c2.id is null;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786