The fundamental problem here is that names don't work like that.
Rules, such as mcallister -> McAllister
aren't global. That is a relevant conversion that can be applied to scottish names. It cannot be applied to vietnamese names.
You're basically praying that no conflicts exist.
Unfortunately, conflicts DO exist. Therefore, it is not possible to 'normalize' a name, unless you know the language that the name is written in. Which you almost never do; even if it is an app focused solely on citizens of scotland, there are people living in scotland whose names have roots from other languages.
I can give you an example of this:
In swedish, ö
is normalized to o
. In german, ö
is normalized to oe
.
Sjögren's syndrome is also a name of a disease (named after a swedish doctor). The name therefore shows up in all sorts of locales, even if you disregard the notion that a swede can move to germany and settle there, and will likely be quite miffed if you normalize their last name to sjoegren
just because they're using a german website.
Another related example: In dutch, it is common policy to collate last names by disregarding the infix, which is quite common in dutch. Someone may be called 'Jos van Dijk', with 'Jos' being their first name, and 'van Dijk' being their last name. Like Mc in scotland, if you find a phone book, Jos would be sorted right before "Astrid Dijkstra", and nowhere near "Merel Valk". However, in the US, where various people whose ancestry hails from the Netherlands have kept their dutch name, a 'Jos van Dijk' living there would find themselves in the phone book under the 'v'. Same name. Different rules.
Similar rules apply to Macintosh: Sometimes, 'Macintosh' is supposed to be spelled and capitalized 'Macintosh'. As per, for example, the computer series built by Apple Inc, or the cultivar of apple that it is named after. Other times, it is supposed to be written 'MacIntosh', more commonly written 'McIntosh'. In fact, whilst in scotland 'Mc' is always followed by a capital (at least, for scottish ancestry names; it's hard to think of names that start with mc that aren't ancestrally scottish), with 'Mac' it goes both ways even with Glasgow University staff surnames (see this english stackexchange answer).
Thus, some people like MacArthur, and others like Macarthur.
Therefore, a normalization scheme is impossible without mangling names. QED.
So, how do you solve this dilemma?
Mostly, you don't. Why do you need to know that a name is 'appropriately capitalized'? It's impossible to know this without knowing the actual person that the name is referring to, which presumably your software doesn't know about. Why is it important?
Another example: Let's say you have a search-by-username feature and you'd like this search feature to find "JoeJackson" if one types "joejackson", or even find "Müller" if someone types "Mueller" (a -very- common request in germany).
no amount of case conversion or elimination of accents is going to allow Sjogren
to equal Sjoegren
, and yet that is exactly what is required if you want this system to work for last names whose origins hail from various ancestries. But what you can do, is search on any username that is 'close', using some appropriate near-misses search construct such as trigrams, e.g. via postgres's pg_trgm
system.
But, rzwitserloot, that sounds so complicated!
¯\(ツ)/¯ yeah, well. Names, dates, addresses, timezones, genders, flags, race designations - if humans are involved it tends to be.