Get Duplicates efficiently

Question

I'm really stuck at the moment, as my query works, but only with some sample data, in the actual database with 15'000 clients I don't even get a response in time.

I have the following table:

-Client-
id
firstname
lastname

Now my query has to get all possible duplicates and list them, so lets say we have Client A with id 1, B with id 2 and C with id = 3 that have the same first- and lastname. The output should look like this:

id | duplicateID | client | duplicate
1  | 2           | A      | B
1  | 3           | A      | C
2  | 3           | B      | C

My query looks like this:

SELECT
  c.id AS clientID,
  d.id AS duplicateID,
  CONCAT(c.firstname, ' ', c.lastname) AS fullName
FROM Client AS c
JOIN Client AS d
  ON d.lastname = c.lastname
  AND d.firstname = c.firstname
  AND d.id != c.id
  AND d.id > c.id
ORDER BY fullName, c.id

Is there any way to increase the performance without losing any results? I looked into this answer, but there I only get one duplicate per client, where I want all of the duplicates.

Any help or tip is appreciated, thanks

Edit: SQL Fiddle as requested

can u give sqlfiddle example ? and wished result ? – echo_Me May 20 '14 at 10:52 — echo_Me, May 20 '14 at 10:52
Please provide proper DDLs – Strawberry May 20 '14 at 11:02 — Strawberry, May 20 '14 at 11:02
from that fiddle you gave , what result you want get ? – echo_Me May 20 '14 at 11:11 — echo_Me, May 20 '14 at 11:11

score 1 · Accepted Answer · answered May 20 '14 at 11:17

This is your query every so slightly simplified (one on condition is removed).

SELECT c.id AS clientID, d.id AS duplicateID, CONCAT(c.firstname, ' ', c.lastname) AS fullName
FROM Client c JOIN
     Client d
     ON d.lastname = c.lastname AND d.firstname = c.firstname AND d.id > c.id
ORDER BY fullName, c.id;

Try using an index.

create index client_lastname_firstname_id on client(lastname, firstname, id);

This should facilitate the join. If you have a lot of data, the order by could become a performance bottleneck.

Another solution would be to put all the duplicates on one line. This only involves a group by, and gives the list of duplicate ids for each name:

SELECT CONCAT(c.firstname, ' ', c.lastname) AS fullName,
       group_concat(c.id order by c.id) AS clientIDs
 FROM Client c 
GROUP BY c.firstname, c.lastname;

Wow, thanks so much! Only adding the index reduced the runtime from 58 seconds to 0.09 seconds! — Michael Kunst, May 20 '14 at 11:43
@MichaelKunst . . . Wow! A 99.8% improvement. That's pretty good ;) — Gordon Linoff, May 20 '14 at 11:44

Get Duplicates efficiently

1 Answers1