Background: I have a large database of people, and I want to look for duplicates, which is more difficult than it seems. I already do a lot of comparison between the names (which are often spelled in different ways), dates of birth and so on. When two profiles appear to be similar enough to the matching algorithm, they are presented to an operator who will judge.
Most profiles have more than one phone number attached, so I would like to use them to find duplicates. They can be entered as "001-555-123456", but also as "555-123456", "555-123456-7-8", "555-123456 call me in the evening" or anything you might imagine. My first idea is to strip all non-numeric characters and get the "longest common substring". There are a lot of algorithms around to find the longest common substring inside a set. But whenever I compare two profiles A and B, I have two sets of phone numbers. I would like to find the longest common substring between a string in the set A and a string in a set B. Can you please help me in finding such an algorithm? I normally program in PHP, a SQL-only solution would be even better, but any other language would go.