2

I'm trying to validate Strings containing names so that they are appropriately capitalised.

I am using

WordUtils.capitalizeFully(name.trim(), ' ', '-', '\'');

Which is working fine for situations such as:

johN o'SmITh => John O'Smith etc

But is there a library available whereby I'm able to add Strings as delimiters in the same way I can with characters for WordUtils? For example:

Mc (like McAllister)

D' (if followed by non-vowel, like D'Souza)

And perhaps being to completely avoid CERTAIN usages of names beginning with Mac as

"Macintosh, and Macdonald" are suitably not camel case

yet "MacDowel" is a suitable camel-case word

and in the same right, with perhaps more use-cases

A decapitalizer which uses Strings as delimiters such as:

de/di (if no follow characters.. e.g. John de Smith)

d' (if followed by a vowel) ... YET do ensure that the vowel is capitalised e.g. John d'Agio

Right now, I'm working through a solution whereby a have a String array of such prefixes, and they're sorted in their appropriate categories by constants such as

final String [] CAPITALISE_FIRST_CHAR_AFTER_THIS_STRING;

followed for-loops which iterate a split() name full name to match each word against that of the appropriate constant array, and apply conditional logic to replace appropriate follow-on characters within a StringBuilder instance, with capitals or whatever the case may be.... and so on, yet I'm simply realising that there is a LOT to get through.

LENGTHY REQUEST my apologies, but I hope it somewhat makes sense

I looked at this Given Name Formatting and it seems ideal

But I can't actually view the link provided in the top answer; is that an issue you also face?

Olaf Kock
  • 46,930
  • 8
  • 59
  • 90
s.l
  • 131
  • 2
  • 6

2 Answers2

1

The fundamental problem here is that names don't work like that.

Rules, such as mcallister -> McAllister aren't global. That is a relevant conversion that can be applied to scottish names. It cannot be applied to vietnamese names.

You're basically praying that no conflicts exist.

Unfortunately, conflicts DO exist. Therefore, it is not possible to 'normalize' a name, unless you know the language that the name is written in. Which you almost never do; even if it is an app focused solely on citizens of scotland, there are people living in scotland whose names have roots from other languages.

I can give you an example of this:

In swedish, ö is normalized to o. In german, ö is normalized to oe.

Sjögren's syndrome is also a name of a disease (named after a swedish doctor). The name therefore shows up in all sorts of locales, even if you disregard the notion that a swede can move to germany and settle there, and will likely be quite miffed if you normalize their last name to sjoegren just because they're using a german website.

Another related example: In dutch, it is common policy to collate last names by disregarding the infix, which is quite common in dutch. Someone may be called 'Jos van Dijk', with 'Jos' being their first name, and 'van Dijk' being their last name. Like Mc in scotland, if you find a phone book, Jos would be sorted right before "Astrid Dijkstra", and nowhere near "Merel Valk". However, in the US, where various people whose ancestry hails from the Netherlands have kept their dutch name, a 'Jos van Dijk' living there would find themselves in the phone book under the 'v'. Same name. Different rules.

Similar rules apply to Macintosh: Sometimes, 'Macintosh' is supposed to be spelled and capitalized 'Macintosh'. As per, for example, the computer series built by Apple Inc, or the cultivar of apple that it is named after. Other times, it is supposed to be written 'MacIntosh', more commonly written 'McIntosh'. In fact, whilst in scotland 'Mc' is always followed by a capital (at least, for scottish ancestry names; it's hard to think of names that start with mc that aren't ancestrally scottish), with 'Mac' it goes both ways even with Glasgow University staff surnames (see this english stackexchange answer).

Thus, some people like MacArthur, and others like Macarthur.

Therefore, a normalization scheme is impossible without mangling names. QED.

So, how do you solve this dilemma?

Mostly, you don't. Why do you need to know that a name is 'appropriately capitalized'? It's impossible to know this without knowing the actual person that the name is referring to, which presumably your software doesn't know about. Why is it important?

Another example: Let's say you have a search-by-username feature and you'd like this search feature to find "JoeJackson" if one types "joejackson", or even find "Müller" if someone types "Mueller" (a -very- common request in germany).

no amount of case conversion or elimination of accents is going to allow Sjogren to equal Sjoegren, and yet that is exactly what is required if you want this system to work for last names whose origins hail from various ancestries. But what you can do, is search on any username that is 'close', using some appropriate near-misses search construct such as trigrams, e.g. via postgres's pg_trgm system.

But, rzwitserloot, that sounds so complicated!

¯\(ツ)/¯ yeah, well. Names, dates, addresses, timezones, genders, flags, race designations - if humans are involved it tends to be.

rzwitserloot
  • 85,357
  • 5
  • 51
  • 72
  • Thanks for the reply dude, I've made the exception for Mc--- as my only String "delimiter", as where I'm from that would be -quite- a common surname for our users Alternative (for consistency on the database) may be that I just capitalise everything by default so there no room for misinterpretation But no, you're definitely right, any other solution definitely couldn't cater to *everyone* One of the main reasons I'm leaving it alone is, as said, Mac is a common prefix and you can't doctor that usage... MacIntyre, Macey, Macdonald.. there's no rule on camel casing there at all really – s.l Jul 01 '20 at 08:43
  • I'm sure there are ethnic names out there with a name starting with Mc where it shouldn't expect a follow-up capital. It would be incorrect to say that's a shame, but definitely an obstacle it's hard to tackle. Forced capitalisation of all characters might be the way forward – s.l Jul 01 '20 at 08:46
0

Two issues at hand

  1. Finding appropriate names through validation
  2. Capitalizing each name, first and last

Let's start with the the first. WordUtils is a strong library provided by Apache Commons. However, the best way to validate names (and I'd argue, all String objects) is by using Regular Expressions. Secondly, with respect to capitalizing the first name, you can lower case the whole string, and then upper case the first letter.

String name = joHN smITh;
String[] names = name.split(" "); // first and last name stored in String[]

String lowerCaseFirst = names[0].toLowerCase(); // john
String first = lowerCaseFirst.subString(0,1).toUpperCase() + lowerCaseFirst.subString(1); // John

String lowerCaseLast = names[1].toLowerCase(); // smith
String last = lowerCaseLast.subString(0,1).toUpperCase() + lowerCaseLast.substring(1); // Smith

String result = first + last; // John Smith
Ari
  • 156
  • 5