0

I'm trying to combine two databases that have names of people. I'm facing issues with names that are slightly different like someone named Joe in one DB and Joseph in another. I tried googling around for a library or tool for this, but could not find anything.

Is there a human name normalizer out there? I can't be the only/the first one to tackle this sort of stuff..

WeaselFox
  • 7,220
  • 8
  • 44
  • 75
  • 1
    Well, Joe could stand for Jonathan, Joseph or just Joe. An there are zillions such an ambiguities, even without taking in account different language names. How do you expect to handle this? – Eugene Sh. Nov 18 '14 at 16:58
  • @EugeneSh. - true.. but there are things that can be done. Rob is always robert... I am looking for a way around writing these rules by hand myself. – WeaselFox Nov 18 '14 at 17:00
  • @EugeneSh.: Actually, I have never heard of Joe being used as a nickname for Jonathan. – John Y Nov 18 '14 at 17:04
  • @JohnY Me too, but it was the author's example. But it doesn't really matter for the question. Rob could stand for Robin as well :) – Eugene Sh. Nov 18 '14 at 17:05
  • sorry, fixed that.. I'm not a native english speaker.. – WeaselFox Nov 18 '14 at 17:08
  • 1
    Even without ambiguities, there's no guarantee that two people with the same name are the same person. The best you can do is a fuzzy correlation. – Cameron Nov 18 '14 at 17:10
  • That sounds like a common problem. But I never heard of a magical tool able to deal with that ... The worse is with typo in family names. You could think of Lehvenstein distance to prefilter that but you will have to manually look at *slightly different names* – Serge Ballesta Nov 18 '14 at 17:10
  • 3
    @SergeBallesta slightly different like Lehvenstein and [Levenshtein](http://en.wikipedia.org/wiki/Levenshtein_distance)? – jonrsharpe Nov 18 '14 at 17:12
  • You're definitely not the first, but (1) this is a tricky-but-not-"algorithmically-hard" problem and (2) everyone's situation is a little different. So in this space, what has wound up happening is that everyone more-or-less rolls their own. The first place to start is `difflib` in the Python standard library. Googling for "python name matching" pulls up a [few](http://stackoverflow.com/q/682367/95852) [potentially](https://github.com/derek73/python-nameparser) [useful](http://streamhacker.com/2011/10/31/fuzzy-string-matching-python/) [links](https://github.com/seatgeek/fuzzywuzzy). – John Y Nov 18 '14 at 20:21
  • Based on my own personal experience in writing this kind of stuff, I'm of the belief that ultimately, there is no good substitute for having a list of specific strings (like `Joe`) that you map to other specific strings (like `Joseph`). Yes, fuzzy matching is helpful, but I've found it's best used *in addition to* (not instead of) a custom name mapping. – John Y Nov 18 '14 at 20:32

0 Answers0