2

I want to delete all duplicate rows from Mysql table.
But problem is that I did not know which rows are duplicate.
This Mysql table contain a large data round about 500000 rows.
In which some rows are duplicate.
Please guide me how can I do this.

UPDATE:

I need SQL query which I run in phpMyAdmin.
Here is a rough table to understand.
Suppose table name is foo.

+---------------------------------------------------------------------+
| id |   link  |     title              |  description                |
+---------------------------------------------------------------------+
| 1  |  google |     search engine      |  search here free           |  
| 2  |  yahoo  |    also search engine  | findout web easily          |  
| 3  | Facebook|  connect with world    | meet with world             |  
| 4  | google  |  search engine         |  search here free           |
| 5  | msn     | Microsoft network      | network by MS               |
| 6  | google  | search engine          |  search here free           |
| 7  | msn     | Microsoft network      | network by MS               |
| 8  | yahoo   |  also search engine    | findout web easily          |
| 9  | myweb   |  my website            | ideal website               |
|... | ....    | .....continue....      | ..... ... .....             |
+---------------------------------------------------------------------+   

This is a rough table I can not define my table fully because it had round about 500000
rows.Hope this understand you what I want.
I rough query like that.

DELECT all duplicate rows FROM foo

EDIT
I see this questoin is marked as duplicate.But I think it is unique.The link with this you compare it dulpicate.I see this link and there is answer which is marked for usefull for small size table it make changes in index and make unique index. Here is the code of that

  ALTER IGNORE TABLE jobs ADD UNIQUE INDEX idx_name (site_id, title, company );  

If you run this query after this when ever you add any data Mysql check out that it is already present or not if present it stop to adding this.
I already told you that my table is contaning a large data if use this query and after that I add one result in my table it check my whole table in which round about 500000 rows for make it uniquness this make it slow.It is only one new record if this is 10 or Whats your idea about if I want to enter 100000 new records it make too slower it.
And I see other answer most of them are containing HAVING class.It is already slow.

Axeem
  • 670
  • 4
  • 16
  • 26

2 Answers2

7

You can probably do it using a JOIN in a DELETE, joining against a subselect.

More details are required to give much help, but for a rough idea:-

DELETE result 
FROM result
INNER JOIN (SELECT SomeField, COUNT(*) AS RecCount, MAX(DateAddded) AS MaxDateAdded FROM result GROUP BY SomeField) b
ON result.SomeField = b.SomeField AND a.DateAdded != b.MaxDateAdded

This is finding every occurance of SomeField with their corresponding max date added and deleting any where there isn't a match on that max date added.

I assume that you want to keep the latest record.

Note that mass deletes like this are a bit worrying, given that if you get it wrong you potentially delete all your records.

EDIT - version to go with the table you have now given. This will delete the duplicates, just leaving you with the first one that is the same (ie for Google you just get left with the row with id of 1)

DELETE foo 
FROM foo
INNER JOIN (SELECT link, title, description, MIN(id) AS MinId FROM foo GROUP BY link, title, description ) b
ON foo.link = b.link
AND foo.title = b.title 
AND foo.description = b.description
AND foo.id != b.MinId
Kickstart
  • 21,403
  • 2
  • 21
  • 33
  • Sorry, typos now fixed (I hope!) – Kickstart Apr 16 '13 at 10:43
  • Thanks its work good but I change your code like this `DELETE foo FROM foo INNER JOIN (SELECT link, MIN(id) AS MinId FROM foo GROUP BY link ) b ON foo.link = b.link AND foo.id != b.MinId` I think this is faster. – Axeem Apr 16 '13 at 10:53
  • That may be a touch faster, but if (for example) Google had entries with different titles or descriptions then they would be treated as 1. This may or may not be an issue. – Kickstart Apr 16 '13 at 11:07
  • 2
    I also sped this query up by adding a HAVING clause on the nested select, so the INNER JOIN becomes (SELECT link, title, description, MIN(id) AS MinId FROM foo GROUP BY link, title, description **HAVING count(*)>1**) thus removing all non duplicates from the inner join and not having to process them in the inner join – Ben Holness Mar 11 '14 at 16:50
0

using a scripting language such as perl or python

1) do a query to load each row

2) calculate a checksum for all the fields that should be unique. For example, if the name and cost should only have one entry per unique, calculate the checksum on this. A hashing algorithm like md5 would be ideal for this

3) store each checksum with the "id" or some way that the row can be identified for a later delete

4) have a system of rules for resolving which one of a set of non unique records is allowed

5) once the query processing is complete and all the duplicates have been found, use the rules and the set of id / checksums to generate a SQL delete command (or a set of delete commands)

6) execute the delete commands(s)

7) alter your database structure so there is a unique constraint to prevent this happening again

The database only has 500 000 rows so storing id information and checksums is well within the capabilities of most scripting languages

Vorsprung
  • 32,923
  • 5
  • 39
  • 63