Deduping database records comparing values in numerous fields

Question

So I'm trying to clean some phone records in a database table.

I've found out how to find exact matches in 2 fields using:

/* DUPLICATE first & last names */

SELECT 
    `First Name`, 
    `Last Name`, 
     COUNT(*) c 
FROM phone.contacts  
GROUP BY 
    `Last Name`, 
    `First Name` 
HAVING c > 1;

Wow, great.

I want to expand it further to look at numerous fields to see if a phone number in 1 of 3 phone fields is a duplicate.

So I want to check 3 fields (general mobile, general phone, business phone).

1.to see that they are not empty ('') 2.to see if the data (number) in any of them appears in the other 2 phone fields anywhere in the table.

So pushing my limited SQL past its limit I came up with the following which seems to return records with 3 empty phone fields & also records that don't have duplicate phone numbers.

/* DUPLICATE general & business phone nos */

SELECT 
    id, 
   `first name`, 
   `last name`, 
   `general mobile`, 
   `general phone`, 
   `general email`, 
   `business phone`, 
    COUNT(CASE WHEN `general mobile` <> '' THEN 1 ELSE NULL END) as gen_mob, 
    COUNT(CASE WHEN `general phone` <> '' THEN 1 ELSE NULL END) as gen_phone,
    COUNT(CASE WHEN `business phone` <> '' THEN 1 ELSE NULL END) as bus_phone 
FROM phone.contacts 
GROUP BY 
   `general mobile`, 
   `general phone`, 
   `business phone` 
HAVING gen_mob > 1 OR gen_phone > 1 OR bus_phone > 1;

Clearly my logic is flawed & I wondered if someone could point me in the right direction/take pity etc...

Many thanks

please tag which DBMS are you using – Luis LL Aug 15 '13 at 10:27 — Luis LL, Aug 15 '13 at 10:27

Bohemian · Accepted Answer · 2014-03-27T11:50:42.603

5

The first thing you should do shoot the person that named your columns with spaces in them.

Now then, try this:

SELECT DISTINCT
   c.id, 
   c.`first name`, 
   c.`last name`, 
   c.`general mobile`, 
   c.`general phone`, 
   c.`business phone`
from contacts_test c
join contacts_test c2
    on (c.`general mobile`!= '' and c.`general mobile` in (c2.`general phone`, c2.`business phone`))
    or (c.`general phone` != '' and c.`general phone` in (c2.`general mobile`, c2.`business phone`))
    or (c.`business phone`!= '' and c.`business phone` in (c2.`general mobile`, c2.`general phone`))

See a live demo of this query in SQLFiddle.

Note the extra check for phone != '', which is required because the phone numbers are not nullable, so their "unknown" value is blank. Without this check, false matches are returned because of course blank equals blank.

The DISTINCT keyword was added in case there are multiple other rows that match, which would result in a nxn result set.

edited Mar 27 '14 at 11:50

answered Aug 15 '13 at 11:36

Bohemian

412,405
93
575
722

Thanks for that. Totally agree about the column names. I assumed your script would simply join records to the selection that match the values in general mobile, general phone & business phone however what actually happened was a never-ending loop of records exactly the same as the initial selection. Any ideas? – Still_Learning Aug 16 '13 at 12:10
I don't understand what you mean. Please create an [SQLFiddle](http://www.sqlfiddle.com) including data to show the problem – Bohemian Aug 16 '13 at 13:07
See [SQLFiddle data](http://www.sqlfiddle.com/#!2/a3e29/6) Ok so the test data has 3 records, 2 of which contain a duplicate number. I run the above code & it returns 9 rows. Presumably the join statement is not checking that there is a match before joining records Thanks in advance – Still_Learning Aug 17 '13 at 13:26
I can see the problem, and I can fix it, but I must ask you, why are you storing blanks instead of nulls for "no number"? The readon I ask is that it makes the query quite different. Before I write the fix, are you intending to use nulls, blanks or either for missing numbers? – Bohemian Aug 18 '13 at 00:11
I'll use nulls if that makes things easier. I simply inherited the current situation. Thanks – Still_Learning Aug 18 '13 at 16:33
It would have been easier because null is not equal to null, so the extra check (see edited answer and notes) would not have been required, but it's OK. All fixed. – Bohemian Aug 18 '13 at 21:41
Fantastic, that seems to be the ticket. Much appreciated. – Still_Learning Aug 20 '13 at 11:59

score 1 · Answer 2 · answered Aug 15 '13 at 11:51

In my experience, when cleaning up data, it's much better to have a comprehending view of the data, and a simple way to manage it, than to have a big and bulky query that does all the analysis at once.

You can also, (more-or-less) renormalize the database, using something like:

Create view VContactsWithPhones
as
Select id, 
       `Last Name` as LastName, 
       `First Name` as FirstName,
       `General Mobile` as Phone,
       'General Mobile' as PhoneType
From phone.contacts c
UNION
Select id, 
       `Last Name`, 
       `First Name`,
       `General Phone`,
       'General Phone'
From phone.contacts c
UNION
Select id, 
       `Last Name`, 
       `First Name`,
       `Business Phone`,
       'Business Phone'
From phone.contacts c

This will generate a view with triple the rows of the original table, but with a Phone column, that can be of one of three types.

You can than easily select from that view:

//empty phones
SELECT * 
FROM VContactsWithPhones 
Where Phone is null or Phone = ''

//duplicate phones
Select Phone, Count(*)
from VContactsWithPhones 
where (Phone is not null and Phone <> '')  -- exclude empty values
group by Phone
having count(*) > 1

//duplicate phones belonging to the same ID (double entries)
Select Phone, ID, Count(*)
from VContactsWithPhones 
where (Phone is not null and Phone <> '')  -- exclude empty values
group by Phone, ID
having count(*) > 1

//duplicate phones belonging to the different ID (duplicate entries)
Select v1.Phone, v1.ID, v1.PhoneType, v2.ID, v2.PhoneType
from VContactsWithPhones v1
   inner join VContactsWithPhones v2 
     on v1.Phone=v2.Phone and v1.ID=v2.ID
where v1.Phone is not null and v1.Phone <> ''

etc, etc...

score 0 · Answer 3 · answered Aug 15 '13 at 11:16

You can try something like:

SELECT * from phone.contacts p WHERE `general mobile` IN (SELECT `general mobile` FROM phone.contacts WHERE id != p.id UNION SELECT `general phone` FROM phone.contacts WHERE id != p.id UNION SELECT `general email` FROM phone.contacts WHERE id != p.id)

Repeat 3 times for each: general mobile, general phone and general email. It can be put in a single query but would be less readable.

Deduping database records comparing values in numerous fields

3 Answers3