0

I've not seen anyone ask for help with this specific issue.

I've got a table with 300,000 rows in it. Each row has a unique id, several columns but there is no timestamp etc.

The issue I have is the user has managed to import new data in to the DB and so now some of the rows are duplicated.

For the rows having this issue there are 2 rows which are identical apart from the ID.

Is there any way to search the whole table, find the duplicated rows based on name and remove the rows with the old ID ?

I need to ensure only a duplicate is removed and only the old entry.

So far I've come up with the following which shows the duplicate rows.

SELECT id, name, COUNT(name) AS cnt
FROM Sites
GROUP BY name
HAVING (cnt > 1)

This produces output with id, name, cnt and shows there are 50,000 entries to be removed. The id shown does appear to be the old ID.

is there anyway to feed this into a delete command to remove the entries ?

Thanks

The Impaler
  • 45,731
  • 9
  • 39
  • 76
Tom
  • 1,436
  • 24
  • 50
  • What version of MySQL? 5.x or 8.x? – The Impaler Oct 07 '20 at 13:26
  • 3
    Does this answer your question? [Delete all Duplicate Rows except for One in MySQL?](https://stackoverflow.com/questions/4685173/delete-all-duplicate-rows-except-for-one-in-mysql) (see accepted answer). – Akina Oct 07 '20 at 13:27
  • *find the duplicated rows based on name and remove the rows with the old ID ?* As you have said "there is no timestamp" - so you cannot tell about "old". You may tell, for example, "remove the rows with lower `id`". – Akina Oct 07 '20 at 13:30
  • @Akina - you are correct. I've updated my original ticket and I need to remove the duplicate row with the lower id. – Tom Oct 07 '20 at 13:49
  • @TheImpaler `10.2.29-MariaDB` – Tom Oct 07 '20 at 13:49

3 Answers3

1

As far as I understand, there's two duplicate rows now in table and you want to delete the old one or the one with smaller id.

You can INNER JOIN the same table.

First, confirm all rows to delete:

SELECT t1.* FROM Sites t1 
    INNER JOIN Sites t2 
    WHERE t1.name = t2.name 
    AND t1.id < t2.id

This should only return original rows with smaller ID

Second, if all returned rows were correct, now you can use that query and get each ID and use this in DELETE statement

DELETE * FROM Sites WHERE id IN (
    SELECT t1.id FROM Sites t1 
        INNER JOIN Sites t2 
        WHERE t1.name = t2.name
        AND t1.id < t2.id
)

You can add more columns with AND from your table to check exact duplicate rows.

  • The first query is running but nothing is returned. Actually its still running several minutes later. My example to find the 50k duplicates take 3 seconds. is there any way to use that with the Delete ? – Tom Oct 07 '20 at 14:35
1

This seems to have worked for me.

DELETE FROM Sites WHERE id IN (
 SELECT * FROM (
  SELECT id FROM Sites GROUP BY name HAVING COUNT(name) >= 2
 ) AS a
);

Thanks

Tom
  • 1,436
  • 24
  • 50
0

Assuming the old ID is a lower value, you can use FIRST_VALUE partitioned by name (not being familiar with your table) and ordered by ID.

https://mariadb.com/kb/en/first_value/

3C41DC
  • 63
  • 7