0
select * from table1 where ID in (
    select min(a.ID) from (select * from table1) a group by id_x, id_y, col_z having count(*) > 1)

Above query ran in 2.2 seconds returning four result. Now when I change the select * to delete, it hangs up indefinitely.

delete from table1 where ID in (
    select min(a.ID) from (select * from table1) a group by id_x, id_y, col_z having count(*) > 1)

If I move the position of group by clause inside the alias select query, it will no longer hang.

delete from table1 where ID in (
    select a.ID from (select min(ID) from table1 group by id_x, id_y, col_z having count(*) > 1) a)

Why does it hang? Even though (select * from table1) pulls millions of records, the query doesn't seem to stop executing for hours. Can anybody explain what huddles the query? It puzzles me because the select query works fine whereas the delete query hangs.

EDIT: My focus here is why it hangs. I already posted work-around that works fine. But in order to develop prevention system, I need to get to the root cause of this..

Ryan
  • 130
  • 11
  • MySQL is very poor at optimizing `WHERE ID IN (SELECT ...)`. Use a `JOIN` instead. – Barmar May 09 '19 at 15:09
  • Did you ask this same question a few days ago? I'm certain I remember answering a question like this recently, but I can't find it now. Did you delete it after I answered? – Barmar May 09 '19 at 15:11
  • If you think it shouldn't have been closed as a duplicate, you should have requested that it be reopened, not post the same question again. – Barmar May 09 '19 at 15:17
  • I already did request for the other question to be reopened by editing twice. However, it was never brought to attention, it seems. And you haven't answered the other question. Also, the other question was not a duplicate – Ryan May 09 '19 at 15:21

2 Answers2

1

Use a JOIN instead of WHERE ID IN (SELECT ...).

DELETE t1
FROM table1 AS t1
JOIN (
    SELECT MIN(id) AS minId
    FROM table1
    GROUP BY id_x, id_y, col_z 
    HAVING COUNT(*) > 1) AS t2
ON t1.id = t2.minId

I think your query is not being optimized because it has to recalculate the subquery after each deletion, since deleting a row could change the MIN(id) for that group. Using a JOIN requires the grouping and aggregation to be done just once.

Barmar
  • 741,623
  • 53
  • 500
  • 612
  • Thanks for explaining how it needs to recalculate the `min` after each deletion. It worked fine after removing `min` evaluation. – Ryan May 09 '19 at 15:56
0

Try this:

delete t
    from table1 t join
         (select min(id) as min_id
          from table1
          group byid_x, id_y, col_z
          having count(*) >= 2
         ) tt
         on tt.min_id = t.id;

That said, you probably don't want to delete just the minimum id. I'm guessing you want to keep the most recent id. If so:

delete t
    from table1 t left join
         (select max(id) as max_id
          from table1
          group byid_x, id_y, col_z
          having count(*) >= 2
         ) tt
         on tt.max_id = t.id
    where tt.max_id is null;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786