I'm working on some stuff for an in-house CRM. The company's current frontend allows for lots of duplicates. I'm trying to stop end-users from putting in the same person because they searched for 'Bill Johnson' and not 'William Johnson.' So the user will put in some information about their new customer and we'll find the similar names (including fuzzy names) and match them against what is already in our database and ask if they meant those things... Does such a database or technology exist?
-
Haha thanks BenV, I didn't know there was a fuzzy-search tag :) – Jon Phenow Jul 20 '10 at 13:25
-
This sounds like the sort of problem at which you need to throw every trick you can find. Consider buying yourself a few baby name books :P International versions, too. – detly Jul 20 '10 at 13:46
-
Haha like I said - time to open source a new github project :) I wonder if there'd be a way to do it that both PHP and ASP.NET could take advantage of what I do - I don't know if I want to do this in js though.... – Jon Phenow Jul 20 '10 at 13:58
-
Well this has easily become the most helpful post I've done :) Thanks everyone! Everyone posted very helpful links and suggestions – Jon Phenow Jul 20 '10 at 14:46
6 Answers
I implemented such a functionality on one website. I use double_metaphone() + levenstein() in PHP. I precalculate a double_metaphone() for each entry in the dabatase, which I lookup using a SELECT of the first x chars of the 'metaphoned' searched term.
Then I sort the returned result according to their levenstein distance. double_metaphone() is not part of any PHP library (last time I checked), so I borrowed a PHP implementation I found somewhere a long while ago on the net (site no longer on line). I should post it somewhere I suppose.
EDIT: The website is still in archive.org: http://web.archive.org/web/20080728063208/http://swoodbridge.com/DoubleMetaPhone/
or Google cache: http://webcache.googleusercontent.com/search?q=cache:Tr9taWl9hMIJ:swoodbridge.com/DoubleMetaPhone/+Stephen+Woodbridge+double_metaphon
which leads to many other useful links with source code for double_metaphone(), including one in Javascript on github: http://github.com/maritz/js-double-metaphone
EDIT: Went through my old code, and here are roughly the steps of what I do, pseudo coded to keep it clear:
1) Precompute a double_metaphone() for every word in the database, i.e., $word='blahblah'; $soundslike=double_metaphone($word);
2) At lookup time, $word is fuzzy-searched against the database: $soundslike = double_metaphone($word)
4) SELECT * FROM table
WHERE soundlike
LIKE $soundlike (if you have levenstein stored as a procedure, much better: SELECT * FROM table WHERE levenstein(soundlike
,$soundlike) < mythreshold ORDER BY levenstein(word
,$word) ASC LIMIT ... etc.
It has worked well for me, although I can't use a stored procedure, since I have no control over the server and it's using MySQL 4.20 or something.

- 3,582
- 1
- 20
- 19
-
-
Added URLs to more information re. various double_metaphone() implementation. Now I remember better: I borrowed Stephen Woodbridge's PHP implementation, and modified it slightly for my need (removed the limit in order to double_metaphone the whole term, not just the first fourth characters, etc.) – R. Hill Jul 20 '10 at 14:22
-
Would be curious to hear how this idea played out - judging by the link to github, not so much. But the use case remains and even if its fuzzy it sounds like this approach had promise. Or is it just that Lucene's that much better? – justSteve Mar 10 '12 at 21:53
I asked a similar question once. Name Hypocorism List I never did get around to doing anything with it but the problem has come up again at work so I might write and open source a library in .net for doing some matching.
Update: I ported the perl module I mentioned there to C# and put it up on github. http://github.com/stimms/Nicknames
-
Yes I was considering doing an open/php/mysql alternative cause that's my thing :) but for this specific project I'd love to find a .net matcher with say nicknames and so forth – Jon Phenow Jul 20 '10 at 13:30
-
That bit of perl could be useful for adding to a new open project for sure – Jon Phenow Jul 20 '10 at 13:39
Implement the Levenshtein distance:
http://en.wikipedia.org/wiki/Levenshtein_distance
This can be written as a SQL Function and queried many different ways.

- 38,138
- 7
- 87
- 101
-
A+ for cool computer science ideals. As I said in another comment about hoping to open up a mysql names database I should like to include something to this effect. My only worry is that this will in many cases check for distance of error, when in fact some nicknames hardly look like their origin, but again, quite helpful nontheless – Jon Phenow Jul 20 '10 at 13:33
-
Mmh pairs like Jim and James are quite different, not sure it's an appropriate measure. – Mau Jul 20 '10 at 13:33
-
yes, I think it could catch a good number of the mishaps though and it would still have to check against our current database so if I required the difference be under a score of 4 AND its in our database could be some useful results – Jon Phenow Jul 20 '10 at 13:38
Well SSIS has some fuzzy logic tasks we use to find duplicates after the fact.
I think though you need to have your logic look at more than just the name for best results. If they are putting in address, email or phone information, perhaps you could look for people with the same last name with one or more of those other matches and ask if one of them will do. You could also make a table of nicknames for various names and match on that. You won't get all of them, but you could get some of the most common in your country at least.

- 94,695
- 15
- 113
- 186
-
yes precisely. That's why I'd like to open some project on github soon so i can add as many as I can think of and people from elsewhere can tack on their localized nickname matching – Jon Phenow Jul 20 '10 at 13:35
You can use SOUNDEX to get similar sounding names. However, it won't match with William and Bill for example.
Try this in SQL as an example.
SELECT SOUNDEX('John'), SOUNDEX('Jon')

- 100,552
- 23
- 116
- 167
-
Is this a standard SQL function? I know Oracle supports this, but do any others? – TMN Jul 20 '10 at 13:42
-
-
It's in SQL Server, which the OP tagged the question as. I don't know about others. – Robin Day Jul 20 '10 at 14:01
There is some built-in SOUNDS LIKE functionality in SQL Server, see SOUNDEX http://msdn.microsoft.com/en-us/library/aa259235%28SQL.80%29.aspx
As for full / nickname searching there isn't anything built it that I am aware of. Nicknames vary by region and it's a lot of information to keep track of. There might be a database linking full names to nicknames that you could leverage in your own application.

- 12,343
- 1
- 24
- 28