0
select * from table1 where ID in (
    select min(a.ID) from (select * from table1) a group by id_x, id_y, col_z having count(*) > 1)

Above query ran in 2.2 seconds returning four result. Now when I change the select * to delete, it hangs up indefinitely.

delete from table1 where ID in (
    select min(a.ID) from (select * from table1) a group by id_x, id_y, col_z having count(*) > 1)

If I move the position of group by clause inside the alias select query, it will no longer hang.

delete from table1 where ID in (
    select a.ID from (select min(ID) from table1 group by id_x, id_y, col_z having count(*) > 1) a)

Why does it hang? Even though (select * from table1) pulls millions of records, the query doesn't seem to stop executing for hours. Can anybody explain what huddles the query? It puzzles me because the select query works fine whereas the delete query hangs.

Ryan
  • 130
  • 11
  • Not sure why it is hanging, but if you have counts > 2 you'd have to run it multiple times (as it only gets rid of the first of each set). You could accomplish it with one execute with `... ID NOT IN (SELECT MAX(a.id) ..... )` **without** the HAVING; this would delete all but the last for each set. – Uueerdo May 03 '19 at 23:00
  • Having count(*) > 1 is necessary because I am looking to delete only the duplicates. `... ID NOT IN (SELECT MAX(a.id) GROUP BY x, y, z ... )` will pull the records whether it's duplicate or not. – Ryan May 04 '19 at 02:07
  • 1
    It is supposed to pull whether there are duplicates or not ; the `NOT IN` portion means it will delete everything not in the list, the list contains the id values for the most recent entry for each "group" (in groups with only one entry, it will be the id of the only entry.) Basically, the alternative I supplied says "keep these". – Uueerdo May 06 '19 at 16:54
  • That's good point. I see how your alternative approach will work. Are you saying that the `having` clause impacts the performance? And does it impact performance more than pulling almost every IDs in the table to the `IN` clause? Genuine question.. I don't know SQL optimization very much that which one will have better performance. – Ryan May 07 '19 at 00:21
  • Also, in the above question, the hanging problem exist even when I changed to adopt your solution using `NOT IN`. I had to bring `GROUP BY` clauses into the aliasing select query `(select * from table1 group by ...)` – Ryan May 07 '19 at 00:23
  • I just figured it was worth a shot trying a slightly different angle; and figured that even if it took just as long, at least it would only need once. Your original query would have to be executed three times if you had a group with 4 members. Another alternative that can help (especially on large tables) is to create an empty copy of the table, insert-select the ones you want to keep, drop the original table, and rename the copy. – Uueerdo May 07 '19 at 18:28
  • Also, I just noticed the edit you were trying to point out. My guess as to why it helped is that in the original version the grouping was performed on the subquery, rather than directly on the table; moving the `group by` in allowed the operation to better take advantage of indexes on the table. It hung _because_ the original possible pulled the millions of rows into a hidden temp table; with the edit, it may not have had to access anything but the index. – Uueerdo May 07 '19 at 18:33

0 Answers0