Find common substrings from 2 separate sets

Question

Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.

Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine. My first idea is to strip all non-numeric characters and get the "longest common substring". There are a lot of algorithms around to find the longest common substring inside a set. But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B. Can you please help me in finding such an algorithm? I normally program in PHP, a SQL-only solution would be even better, but any other language would go.

That's not an answer to your question, but you may begin with replacing all national-encoding characters like `ą` and `ö` with their ASCII substitutes and remove all characters that are not necessary, like spaces, `-`, `+`, etc. — Voitcus, May 16 '13 at 10:28
Take also a look at the related questions on the right, eg. the first one http://stackoverflow.com/questions/336605/how-can-i-find-the-largest-common-substring-between-two-strings-in-php?rq=1 — Voitcus, May 16 '13 at 10:29
I would actually strip all the non-numeric, since for telephone numbers they are the important part. — carlo.borreo, May 16 '13 at 10:29
For telephone numbers if they all are the same length it should not be much effort. — Voitcus, May 16 '13 at 10:32
I guess you will need to create something like `full outer join` between A and B profiles. Then compare all candidates and get the longest common susbtring. — Tomas Greif, May 16 '13 at 10:34
Voitcus: yes, I do that, thanks But the phone numbers are from all over the world, in different formats — carlo.borreo, May 16 '13 at 10:34
You don't need to compare numbers from the USA (+1), Germany (+49), Poland (+48) and so on, as they are obviously different. — Voitcus, May 16 '13 at 10:40
Yes, but, strange as it may seem, I don't know in advance where the number comes from. The country code may be present, but not always. — carlo.borreo, May 16 '13 at 10:50

score 1 · Accepted Answer · answered May 16 '13 at 12:19

As Voitcus said before, you have to clean your data first before you start comparing or looking for duplicates. A phone number should follow a strict pattern. For the numbers which do not match the pattern try to adjust them to it. Then you have the ability to look for duplicates.

Morevover you should do data-cleaning before persisting it, maybe in a seperate column. You then dont have to care for that when looking for duplicates ... just to avoid performance peaks.

Algorithms like levenshtein or similar_text() in php, doesnt fit to that use-case quite well.

score 0 · Answer 2 · answered May 16 '13 at 12:05

In my opinion the best way is to strip all non-numeric characters from the texts containing phone numbers. You can do this in many ways, some regular expression would be the best, but see below.

Then, if it is possible, you can find the country direction code, if the user has its location country. If there is none, assume default and add to the string. The same would be probably with the cities. You can try to take a look also in place one lives, their zip code etc.

At the end of this you should have uniform phone numbers which can be easily compared.

The other way is to compare strings with the country (and city) code removed.

About searching "the longest common substring": The numbers thus filtered are the same, however you might need it eg. if someone typed "call me after 6 p.m.". If you're sure that the phone number is always at the beginning, so nobody typed something like 555-SUPERMAN which translates to 555-78737626, there is also possibility to remove everything after the last alphanumeric character (and this character, as well).

There is also a possibility to filter such data in the SQL statement. Consider something like a SELECT ..., [your trimming function(phone_number)] AS trimmed_phone WHERE (trimmed_phone is not numerical characters only) GROUP BY trimmed_phone. If trimming function would remove only whitespaces and special dividers like -, +, . (commonly in use in Germany), , perhaps etc., this query would leave you all phone numbers that are trimmed but contain characters not numeric -- take a look at the results, probably mostly digits and letters. How many of them are they? Maybe they have something common? Maybe some typical phrases you can filter out too?

If the result from such query would not be very much, maybe it's easier just to do it by hand?

Find common substrings from 2 separate sets

2 Answers2