1

I've got an old forum which contains threads with duplicate first posts (perhaps differing replies). I want to delete all but one of these threads (leaving the thread with the highest view count).

I have the following SQL query to help identify duplicate threads, but I can't find a way for it to list only duplicates with the lowest value for the xf_thread.view_count column:

SELECT
    t.thread_id, MIN(t.view_count)
FROM
    xf_thread t
INNER JOIN
    xf_post p ON p.thread_id = t.thread_id
WHERE
    t.first_post_id = p.post_id
GROUP BY
    t.title,
    t.username,
    p.message
HAVING
    COUNT(t.title) > 1
    AND COUNT(t.username) > 1
    AND COUNT(p.message) > 1;

At the moment, this query correctly groups threads, but it only shows a random thread_id - rather than the thread_id corresponding to min(view_count).

I've read up on how to work around this, but I can't figure out how to achieve this - as it doesn't look like it's possible to order the row in group by queries.

edit

Thanks to the help from Madhur, the query now returns all the of the thread IDs to be deleted. However, I can figure out how to delete the rows with matching thread_id's. Here's the query I tried to use (it just keeps running, whereas the select query (https://stackoverflow.com/a/52314208/2469308) runs in seconds:

DELETE FROM xf_thread 
WHERE  thread_id IN (SELECT Substring_index(Group_concat(DISTINCT t.thread_id 
                                            ORDER BY 
                                            t.view_count 
                                                                ASC 
                                            SEPARATOR ','), 
                                   ',', 1) AS 
                            thread_id_with_minimum_views 
                     FROM   (SELECT * 
                             FROM   xf_thread) t 
                            INNER JOIN xf_post p 
                                    ON p.thread_id = t.thread_id 
                     WHERE  t.first_post_id = p.post_id 
                            AND t.user_id = 0 
                            AND t.reply_count < 2 
                     GROUP  BY t.title, 
                               t.username, 
                               p.message 
                     HAVING Count(t.title) > 1 
                            AND Count(t.username) > 1 
                            AND Count(p.message) > 1 
                     ORDER  BY t.thread_id); 
IanCun
  • 47
  • 6
  • May you please post the sample data and expected result also? – Ankit Bajpai Sep 13 '18 at 12:46
  • check https://stackoverflow.com/a/52314208/2469308 Let me know if it works, although a hacky approach, but should work! In MySQL version >= **8.0** , you could have achieved this easily using Window Functions – Madhur Bhaiya Sep 13 '18 at 12:56

1 Answers1

1

A very hacky solution is to get the thread_id sorted by view_count in a GROUP_CONCAT. Then, we can utilize string operations to get the thread_id with minimum view_count.

In your SELECT clause, instead of t.thread_id, you can try the following:

SUBSTRING_INDEX(GROUP_CONCAT(DISTINCT t.thread_id 
                             ORDER BY t.view_count ASC 
                             SEPARATOR ','), 
                ',', 
                1) AS thread_id_with_minimum_views

Now, based on the SELECT query to identify Duplicate records with Minimum view, DELETE query to delete such records from the xf_thread table will be as follows:

DELETE t_delete FROM xf_thread AS t_delete 
INNER JOIN (SELECT CAST(SUBSTRING_INDEX(GROUP_CONCAT(DISTINCT t.thread_id ORDER BY t.view_count ASC SEPARATOR ','), ',', 1) AS UNSIGNED) AS tid_min_view 
            FROM (SELECT * FROM xf_thread) t 
            INNER JOIN xf_post p ON p.thread_id = t.thread_id 
            WHERE t.first_post_id = p.post_id 
              AND t.user_id = 0 
              AND t.reply_count < 2 
            GROUP BY t.title, t.username, p.message 
            HAVING Count(t.title) > 1 
               AND Count(t.username) > 1 
               AND Count(p.message) > 1 
            ORDER BY t.thread_id) AS t_dup 
  ON t_delete.thread_id = t_dup.tid_min_view 
Madhur Bhaiya
  • 28,155
  • 10
  • 49
  • 57
  • It may be hacky, but this looks like it works great! Many thanks. Is there a way I can modify the query to delete the rows, rather than select them? I tried deleting using "DELETE FROM xf_thread WHERE thread_id IN" and then the rest as a sub-query, but that is really really really slow. – IanCun Sep 13 '18 at 13:58
  • What is your delete query ? You can edit the question and add a subpart to it. – Madhur Bhaiya Sep 13 '18 at 13:59
  • Thanks, I've edited the question to include the delete query I tried. – IanCun Sep 13 '18 at 14:09
  • @IanCun subquery will be very slow. Hint: Look at multi-table join delete. Check for eg: https://stackoverflow.com/questions/4154707/delete-from-one-table-with-join Try to figure out. If it still doesn't work, let me know! – Madhur Bhaiya Sep 13 '18 at 14:26
  • Many thanks @Madhur Bhaiya, I've been trying the join delete, but I can't figure out how to get it working with this as a join delete, at least not without using a sub-query. If you've got any tips, I'd really appreciate it, as I'm stuck! The subquery method has been running on my test db for over an hour and hasn't done anything yet, so it must be so much slower! Good to know for future. – IanCun Sep 13 '18 at 15:39
  • 1
    @IanCun Try the following query: `DELETE t_delete FROM xf_thread AS t_delete INNER JOIN (SELECT CAST(SUBSTRING_INDEX(Group_concat(DISTINCT t.thread_id ORDER BY t.view_count ASC SEPARATOR ','), ',', 1) AS UNSIGNED) AS tid_min_view FROM (SELECT * FROM xf_thread) t INNER JOIN xf_post p ON p.thread_id = t.thread_id WHERE t.first_post_id = p.post_id AND t.user_id = 0 AND t.reply_count < 2 GROUP BY t.title, t.username, p.message HAVING Count(t.title) > 1 AND Count(t.username) > 1 AND Count(p.message) > 1 ORDER BY t.thread_id) AS t_dup ON t_delete.thread_id = t_dup.tid_min_view` – Madhur Bhaiya Sep 13 '18 at 18:12
  • @IanCun delete query was fast ? Problem with previous query was that the subquery was performing full select to find duplicate id(s), for **each and every row** – Madhur Bhaiya Sep 13 '18 at 18:20
  • Yes, very fast thank you - much much faster than subselect. – IanCun Sep 13 '18 at 18:32