1

I'm trying to delete a lot of duplicate rows from a SQL table with businesses' codes and businesses' descriptions but I have to keep one for each entry, I have something like 1925 rows and I have 345 rows with duplicates and triple entries, this is the query I used to find duplicates and triple entries:

SELECT codice_ateco_2007, descrizione_ateco_2007, COUNT(*) AS CNT FROM codici_ateco_il_leone GROUP BY codice_ateco_2007, descrizione_ateco_2007 HAVING CNT > 1;

I tried the following but won't work, any of them, when I use CTE I get and error saying unknown function after WITH statement and when I use the other codes like

DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
SELECT MAX(ID)
FROM MyDuplicateTable
GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)

it won't work anyway it says I cannot select the table inside the in function.

Is CTE and the other code out of date or what?How can somebody fix this?By the way there also is id PRIMARY KEY in the codici_ateco_il_leone table.

  • `WITH` is only available in MySQL as of version 8.0. Upgrade if you can. – Thorsten Kettner Aug 10 '20 at 14:00
  • On a side note: Ask yourself how did the duplicates make it into the table? If you don't want duplicates, there should be a unique constraint preventing them from being inserted. – Thorsten Kettner Aug 10 '20 at 16:13
  • I know adeco codes are a long list of business types in Italy every business has got One, I imported the table from a PDF i found on their website to a ods to a csv, needles to say the table was full of duplicates and deleting duplicates with Open Office would have taken too much. Thank you. – il Leone Ami Aug 10 '20 at 19:57

2 Answers2

0

One method is row_number() with a join:

delete mdt
    from MyDuplicateTable mdt join
         (select mdt2.*,
                 row_number() over (partition by DuplicateColumn1, DuplicateColumn2, DuplicateColumn3 order by id) as seqnum
          from MyDuplicateTable mdt2
         ) mdt2
         on mdt2.id = mdt.id
    where seqnum > 1;

A similar approach uses aggregation:

delete mdt
    from MyDuplicateTable mdt join
         (select DuplicateColumn1, DuplicateColumn2, DuplicateColumn3, min(id) as min_id
          from MyDuplicateTable mdt2
          group by DuplicateColumn1, DuplicateColumn2, DuplicateColumn3
          having count(*) > 1
         ) mdt2
         using (DuplicateColumn1, DuplicateColumn2, DuplicateColumn3)
    where mdt.id > mdt2.min_id;

Both of these assume that id is a global unique identifier for each row. That seems reasonable based on the context. However, both can be tweaked if the id can be duplicated for different values of the three key columns.

Gordon Linoff
  • 1,242,037
  • 58
  • 646
  • 786
  • Dude you're amazing, I tried this one and worked just fine!: delete mdt from codici_ateco_il_leone mdt join (select codice_ateco_2007, descrizione_ateco_2007, min(id) as min_id from codici_ateco_il_leone mdt2 group by codice_ateco_2007, descrizione_ateco_2007 having count(*) > 1 ) mdt2 using (codice_ateco_2007, descrizione_ateco_2007) where mdt.id > mdt2.min_id One last question, if I might ask you, are mdt and mdt2 like two temporary tables for the join? – il Leone Ami Aug 10 '20 at 14:11
0

Your delete statement is fine and works in about every DBMS - except for MySQL where you get this stupid error. The solution to this is simple: replace from sometable with from (select * from sometable) somealias:

DELETE
FROM MyDuplicateTable
WHERE ID NOT IN
(
  SELECT MAX(ID)
  FROM (SELECT * FROM MyDuplicateTable) t
  GROUP BY DuplicateColumn1, DuplicateColumn2, DuplicateColumn3
);
Thorsten Kettner
  • 89,309
  • 7
  • 49
  • 73