-1

I have searched numerous answers in here stackoverflow but I can't seem to understand how to delete duplicate rows. In my case, delete rows with the same pd in a td.

I have a table here with a name questions. Let pd be problem_id, td be task_id (just to make the naming short)

id    score  pd    td
'1',  '10', '673', '2663'
'2',  '10', '674', '2663'
'3',  '10', '675', '2663'
'4',  '10', '676', '2663'
'5',  '10', '677', '2663'
'6',  '10', '678', '2663'
'7',  '10', '674', '2663'
'8',  '10', '675', '2663'
'9',  '10', '676', '2663'
'10', '10', '677', '2663'
'11', '10', '710', '2700'
'12', '10', '711', '2700'
'13', '10', '710', '2700'
'14', '10', '711', '2700'

How to do an sql script that will result to this:

id    score  pd    td
'1',  '10', '673', '2663'
'2',  '10', '674', '2663'
'3',  '10', '675', '2663'
'4',  '10', '676', '2663'
'5',  '10', '677', '2663'
'6',  '10', '678', '2663'
'11', '10', '710', '2700'
'12', '10', '711', '2700'

I'm still not an expert in SQL and I have this code that I based from this answer https://stackoverflow.com/a/2594879/7362231 but it seems that it is giving me an error.

DELETE d
    FROM staging_db.questions as d
        INNER JOIN (SELECT
                        y.id, y.problem_id, y.task_id, ROW_NUMBER() OVER(PARTITION BY y.problem_id,y.task_id ORDER BY y.problem_id,y.task_id,y.id) AS RowRank
                        FROM staging_db.questions as y
                            INNER JOIN (SELECT
                                            problem_id,task_id, COUNT(*) AS CountOf
                                            FROM staging_db.questions
                                            GROUP BY problem_id,task_id
                                            HAVING COUNT(*)>1
                                        ) as dt ON y.problem_id=dt.problem_id AND y.task_id=dt.task_id
                   ) dt2 ON d.id=dt2.id
        WHERE dt2.RowRank!=1
        SELECT * FROM staging_db.questions;
John Cymmer
  • 195
  • 1
  • 9

3 Answers3

2

Try the below -

DELETE FROM staging_db.questions
   WHERE id not IN (SELECT * 
                    FROM (SELECT min(n.id)
                            FROM staging_db.questions n
                        GROUP BY pd,td) x)
Fahmi
  • 37,315
  • 5
  • 22
  • 31
  • 2
    @JohnCymmer . . . This is a hack and I would strongly discourage you from using constructs that the documentation *explicitly* warns against using: https://dev.mysql.com/doc/refman/8.0/en/delete.html. – Gordon Linoff Aug 29 '20 at 12:22
1

The proper way to write this in MySQL is to use a join:

DELETE
    FROM staging_db.questions q JOIN
         (SELECT q2.pd, q2.td, MIN(q2.id) as min_id
          FROM staging_db.questions q2
          GROUP BY q2.pd, q2.td
         ) q2
         USING (pd, td)
   WHERE q.id > q2.min_id;

I want to point out the MySQL documentation explicitly states:

Subqueries

You cannot delete from a table and select from the same table in a subquery.

Often, though, deleting lots of rows in a table is inefficient, and it is more efficient to re-populate the table:

create table temp_q as
    select id, score pd, td, 
    from (select q.*,
                 row_number() over (partition by pd, td order by id) as seqnum
          from staging_db.questions q
         ) q
    where seqnum = 1;

truncate table staging_db.questions;

insert into staging_db.questions (id, score pd, td)
    select id, score pd, td
    from temp_q;
Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • 2
    For anything mission critical, or pretty much production instances in general, this is the answer to use. You don’t want any connectivity or server errors at inopportune moments to keep you awake at night. You also want to finish running a SELECT operation on that table object before commencing a DELETE operation on the same object. – ed2 Aug 29 '20 at 13:18
  • I think this is a better answer than Fahmi's especially the second block of code. Thank you! – John Cymmer Aug 30 '20 at 23:28
  • @JohnCymmer are you really convinced that this is a better answer? The 1st query uses aggregation and then a join and what to say about the 2nd? Do you really think that truncating a table every time that you want to delete duplicates is the correct way? This always depends on the number of expected duplicates. – forpas Aug 31 '20 at 07:57
1

The simplest way to do it without the need of any unnecessary aggregation is with a self join:

delete q1
from questions q1 inner join questions q2
on q2.pd = q1.pd and q2.td = q1.td and q2.id < q1.id;

See the demo.
Results:

| id  | score | pd  | td   |
| --- | ----- | --- | ---- |
| 1   | 10    | 673 | 2663 |
| 2   | 10    | 674 | 2663 |
| 3   | 10    | 675 | 2663 |
| 4   | 10    | 676 | 2663 |
| 5   | 10    | 677 | 2663 |
| 6   | 10    | 678 | 2663 |
| 11  | 10    | 710 | 2700 |
| 12  | 10    | 711 | 2700 |
forpas
  • 160,666
  • 10
  • 38
  • 76