0

Sometimes i use this sql query to remove duplicated posts from my posts table. But its takes a long time to finish, letting my domain unaccessible for like 5 minutes. Look like this is a heavy sql for the server, theres a way to make it lighter and faster?

DELETE FROM hotaru_posts WHERE post_id NOT IN ( SELECT a.post_id FROM ( SELECT post_title, post_id FROM hotaru_posts GROUP BY post_title ) a )

The optimal and fastest query for this is

delete from hotaru_posts where post_id in (
  select post_id from (
    select post_id from hotaru_posts a group by post_title having count(post_title) > 1
  ) b
)

This query took only 0.0603 sec. while the fisrt query took more than 5 min. (lol)

Lucas Matos
  • 1,112
  • 5
  • 25
  • 42
  • Does the current one work? It shouldn't: you are selecting a non-aggregated `post_id` in a `GROUP BY post_title`, which is incorrect. – Sergey Kalinichenko Mar 01 '12 at 20:51
  • @dasblinkenlight: yeah, but, IIRC, MySQL will just pick an undetermined value from the group in this case. – Sergio Tulentsev Mar 01 '12 at 20:53
  • @SergioTulentsev In other words, this statement deletes random duplicates :) – Sergey Kalinichenko Mar 01 '12 at 20:57
  • well it works, i got this query here on stack http://stackoverflow.com/questions/9196884/how-to-delete-a-table-row-if-the-field-post-title-is-duplicated-in-another-row – Lucas Matos Mar 01 '12 at 21:00
  • @dasblinkenlight: More specifically, from all duplicates of a post it deletes all but one. Which post survives is up to the destiny :) – Sergio Tulentsev Mar 01 '12 at 21:01
  • @sergio, yes, thats exactly what you just said, but dont matter what post will stay, i need only one, wich one doesnt matter, so thats work for my needs. I just need to make it faster. – Lucas Matos Mar 01 '12 at 21:03
  • You might want to look more carefully into how and why you have such a problem in the first place. What is the cause of all of this duplicated data and is there a reason you aren't using a unique key to prevent this? – StudyOfCrying Mar 01 '12 at 21:20

2 Answers2

1

Would this work?

DELETE FROM hotaru_posts
WHERE post_id IN (
    SELECT post_id
    FROM (
        SELECT MAX(post_id)
        FROM hotaru_posts a
        GROUP BY post_title
        HAVING COUNT(post_title) > 1
    ) b
)
Travesty3
  • 14,351
  • 6
  • 61
  • 98
  • would this work?? you tell me, im afraid to test it cause this is my live website. – Lucas Matos Mar 01 '12 at 21:09
  • 1
    @LucasMatos: copy it and test. Not having a copy of your data is a baaad thing. – Sergio Tulentsev Mar 01 '12 at 21:10
  • `CREATE TABLE testTable AS SELECT * FROM hotaru_posts` - that will duplicate your table and name it 'testTable' try running this query on that table. – Travesty3 Mar 01 '12 at 21:11
  • It does include one less `SELECT` and if it works as I think it will, the new `SELECT` will only retrieve the duplicates instead of everything that's not a duplicate. So yes, I _think_ it will be faster. – Travesty3 Mar 01 '12 at 21:15
  • @Travesty3: Error #1093 - You can't specify target table 'testTable' for update in FROM clause `DELETE FROM testTable WHERE post_id IN ( SELECT MAX( post_id ) FROM testTable GROUP BY post_title HAVING COUNT( post_title ) >1 )` – Lucas Matos Mar 01 '12 at 21:28
  • almost that @Travesty3, check the first post. – Lucas Matos Mar 02 '12 at 04:54
  • @LucasMatos: Ah, so you still needed the double `SELECT` to get around the whole 'updating and selecting from the same table at the same time' issue. Looks like it was much faster to select just the duplicates instead of the non-duplicates, though. Glad to hear it helped. I will update my answer for anyone else that stumbles upon this post as well. – Travesty3 Mar 02 '12 at 18:06
0

The optimal and fastest query for this is

delete from hotaru_posts where post_id in (
  select post_id from (
    select post_id from hotaru_posts a group by post_title having count(post_title) > 1
  ) b
)

This query took only 0.0603 sec. while the fisrt query took more than 5 min. (lol)

Lucas Matos
  • 1,112
  • 5
  • 25
  • 42