4

To learn Regex, I was solving some problems to train and study. And this is the problem, i know it might not be the best way to do with Regex, and my Regex is a mess, but i liked the challenge.

Problem:

  • The names needs to be Title Case;
  • There are exceptions for some lowercase words inside;
  • And some Names, e.g.: McDonald, MacDuff, D'Estoile
  • Names with ' and - are accepted, and sometimes they are o'Brien, O'brien, O'Brien, O' Brien or 'Ehu Kali.
  • No whitespaces on the beggining and end of Name;
  • No more than one space between each Name of Full Name;
  • A . is accepted if not alone, e.g.: Dan . Ferdnand (isn't accepted) and Dan G. Ferdnand (is accepted)
  • Numbers and symbols are not accepted
  • However, Roman numbers are accepted and aren't Title Case, e.g.: Elizabeth II
  • Some names can be alone, e.g.: Akihito (Prince of Japan)
  • Some special characters common in some countries are accepted, e.g.: Valeh ßlÿsgÿroğlu, Lażżru Role, Alaksiej Taraškievič

Regex

The code is

^(?![ ])(?!.*(?:\d|[ ]{2}|[!$%^&*()_+|~=`\{\}\[\]:";<>?,\/]))(?:(?:e|da|do|das|dos|de|d'|la|las|el|los|l'|al|of|the|el-|al-|di|van|der|op|den|ter|te|ten|ben|ibn)\s*?|(?:[A-ZàáâäãåąčćęèéêëėįìíîïłńòóôöõøùúûüųūÿýżźñçčšžÀÁÂÄÃÅĄĆČĖĘÈÉÊËÌÍÎÏĮŁŃÒÓÔÖÕØÙÚÛÜŲŪŸÝŻŹÑßÇŒÆČŠŽ∂ð'][^\s]*\s*?)(?!.*[ ]$))+$

And the Regex101 with a validation list

References

What i tried so far was based on these:

Not working

I did this Regex and don't know how to make a way for it to not recognize the cases below, that are matching:

  • CAPITAL LETTER
  • AlTeRnAtE LeTtEr

And those aren't and should:

  • Urxan Əbűlhəsənzadə
  • İsmət Jafarov
  • Şükür Hagverdiyev
  • Űmid Abdurrahimov
  • Ġerardo Seralta
  • Ċikku Paris

Question

Is there a way to optimize this Regex (monster)?

And how do i fix the problems stated before on Not working?

p.s.: The list of names with examples for validation can be found on the link to Regex101.

Community
  • 1
  • 1
danieltakeshi
  • 887
  • 9
  • 37
  • 1
    Thank you for talking the time to write a well-written post. You are helping restore my faith in properly written and formatted questions! – ctwheels Oct 06 '17 at 14:42
  • 1
    Hard to validate names with 1 regex and cover all languages. `No more than one space inside Name;` so my name is not accepted? :'( ... anyway, maybe if you are practicing try out https://www.debuggex.com/, it helped me a lot. – Ron van der Heijden Oct 06 '17 at 14:57
  • I will edit/rephrase, it is no more than two spaces between each name – danieltakeshi Oct 06 '17 at 14:58
  • 1
    have you seen [Javascript + Unicode regexes](https://stackoverflow.com/questions/280712/javascript-unicode-regexes) – Shakiba Moshiri Oct 06 '17 at 15:08
  • 3
    Note that there are differences between how languages and tools implement regular expressions (often stated as different regex "flavours"). Have you decided on a flavour to use (your regex101 link would suggest JavaScript's) ? If so please tell us which one as it might lead to different answers, especially on the topic of non-ascii character handling. – Aaron Oct 06 '17 at 15:09
  • I was using Excel to implement Regex, because i have experience with it. But could be either C++ or Excel, since i have experience with both. However, the flavour can be Java, i am just learning and it will be nice to learn more about java, i just know the basics of web java. – danieltakeshi Oct 06 '17 at 16:24

1 Answers1

1

Brief

Seeing as how you're learning Regex and haven't specified a regex flavour to use, I've chosen PCRE as it has a wide variety of support in the regex world.


Code

See this regex in use here

(?(DEFINE)
    (?# Definitions )
    (?<valid_nameChars>[\p{L}\p{Nl}])
    (?<valid_nonNameChars>[^\p{L}\p{Nl}\p{Zs}])
    (?<valid_startFirstName>(?![a-z])[\p{L}'])
    (?<valid_upperChar>(?![a-z])\p{L})
    (?<valid_nameSeparatorsSoft>[\p{Pd}'])
    (?<valid_nameSeparatorsHard>\p{Zs})
    (?<valid_nameSeparators>(?&valid_nameSeparatorsSoft)|(?&valid_nameSeparatorsHard))
    (?# Invalid combinations )
    (?<invalid_startChar>^[\p{Zs}a-z])
    (?<invalid_endChar>.*[^\p{L}\p{Nl}.\p{C}]$)
    (?<invalid_unaccompaniedSymbol>.*(?&valid_nameSeparatorsHard)(?&valid_nonNameChars)(?&valid_nameSeparatorsHard))
    (?<invalid_overTwoUpper>(?:(?&valid_nameChars)*\p{Lu}){3})
    (?<invalid>(?&invalid_startChar)|(?&invalid_endChar)|(?&invalid_unaccompaniedSymbol)|(?&invalid_overTwoUpper))
    (?# Valid combinations )
    (?<valid_name>(?:(?:(?&valid_nameChars)|(?&valid_nameSeparatorsSoft))*(?&valid_nameChars)+(?:(?&valid_nameChars)|(?&valid_nameSeparatorsSoft))*)+\.?)
    (?<valid_firstName>(?&valid_startFirstName)(?:\.|(?&valid_name)*))
    (?<valid_multipleName>(?&valid_firstName)(?=.*(?&valid_nameSeparators)(?&valid_upperChar))(?:(?&valid_nameSeparatorsHard)(?&valid_name))+)
    (?<valid>(?&valid_multipleName)|(?&valid_firstName))
)
^(?!(?&invalid))(?&valid)$

Results

Input

== 1NcOrrect N4M3S ==
CAPITAL LETTER
AlTeRnAtE LeTtEr
Natalia maria
Natalia aria
Natalia orea
Maria dornelas
Samuel eto'
Miguel lasagna
Antony1 de Home Ap*ril
Ap*ril Willians
Antony_ de Home Apr+il
Ant_ony de Home Apr#il
Antony@ de Ho@me Apr^il
Maria  Silva
Maria silva
maria Silva
 Maria Silva
Maria Silva 
Maria / Silva
Maria . Silva
John W8

==Correct Names==
Urxan Əbűlhəsənzadə
İsmət Jafarov
Şükür Hagverdiyev
Űmid Abdurrahimov
Ġerardo Seralta
Ċikku Paris
Hind ibn Sheik
Colop-U-Uichikin
Lażżru Role
Alaksiej Taraškievič
Petruso Husoǔski
Sumu-la-El
Valeh ßlÿsgÿroğlu
'Arab al-Rashayida
Tariq al-Hashimi
Nabeeh el-Mady
Tariq Al-Hashimi
Brian O'Conner
Maria da Silva
Maria Silva
Maria G. Silva
Maria McDuffy
Getúlio Dornelles Vargas
Maria das Flores
John Smith
John D'Largy
John Doe-Smith
John Doe Smith
Hector Sausage-Hausen
Mathias d'Arras
Martin Luther King Jr.
Ai Wong
Chao Chang
Alzbeta Bara
Marcos Assunção
Maria da Silva e Silva
Juscelino Kubitschek de Oliveira
Maria da Costa e Silva
Samuel Eto'o
María Antonieta de las Nieves
Eugène
Antòny de Homé April
àntony de Home ùpril
Antony de Home Aprìl
Pierre de l'Estache
Pierre de L'Estoile
Akihito
Nadine Schröder
Anna A. Møller
D. Pedro I
Pope Benedict XVI
Marsibil Ragnarsdóttir
Natanaël Morel
Isaac De la Croix
Jean-Michel Bozonnet
Qutaibah Mu'tazz Abadi
Rushd Jawna' Kassab
Khaldun Abdul-Qahhar Sabbag
'Awad Bashshar Asker
Al B. Zellweger
Gunnleif Snæ-Ulfsson
Käre Toresson
Sorli Ærnmundsson
Arnkel Øystæinsson
Ástríður Dórey
Åsmund Kåresson
Yahatti-Il
Ipqu-Annunitum
Nabu-zar-adan
Eskopas Cañaverri
Botolph of Langchester
Aelfhun the Cantrell
Fraco di Natale
Fraco Di Natale
Iván de Luca
Iván De Luca
Man'nah
Atabala Aüamusalü
Ramiz Ağasəfalu
Dadaş Aghakhanov
Fÿrxad Mübarizlı
Vaclaǔ Šupa
Yakiv Volacič
Flor Van Vaerenbergh
Flor van Vaerenbergh
Edwin van der Sar
Husein Ekmečić
Álvaro Guimarães Alencar
Phone U Yaza Arkar
Seocan MacGhille
X'wat'e Tlekadugovy
Albert-Jan Bootsveld
Maurits-jan Kuipers op den Kollenstaart
Elco ter Hoek
Robbert te Poele
Aad ten Have
'Ehu Kali
Ho'opa'a Loni
Aukanai'i Mahi'ai
Kalman ben Tal El
Żytomir Roszkowski
K'awai

==EXTRA== only if possible, strange ones
Maol-Moire Mac'IlleBhuidh
Tòmas MacIlleChruim
Aindreas MacIllEathain
Eanruig MacGilleBhreac
Peadar MacGilleDhonaghart
Maolmhuire MacGill-Eain
Eanruig MacGilleBhreac
Wim van 't Plasman

Output

Note: Shown below are only the strings that matched from the above Input

Urxan Əbűlhəsənzadə
İsmət Jafarov
Şükür Hagverdiyev
Űmid Abdurrahimov
Ġerardo Seralta
Ċikku Paris
Hind ibn Sheik
Colop-U-Uichikin
Lażżru Role
Alaksiej Taraškievič
Petruso Husoǔski
Sumu-la-El
Valeh ßlÿsgÿroğlu
'Arab al-Rashayida
Tariq al-Hashimi
Nabeeh el-Mady
Tariq Al-Hashimi
Brian O'Conner
Maria da Silva
Maria Silva
Maria G. Silva
Maria McDuffy
Getúlio Dornelles Vargas
Maria das Flores
John Smith
John D'Largy
John Doe-Smith
John Doe Smith
Hector Sausage-Hausen
Mathias d'Arras
Martin Luther King Jr.
Ai Wong
Chao Chang
Alzbeta Bara
Marcos Assunção
Maria da Silva e Silva
Juscelino Kubitschek de Oliveira
Maria da Costa e Silva
Samuel Eto'o
María Antonieta de las Nieves
Eugène
Antòny de Homé April
àntony de Home ùpril
Antony de Home Aprìl
Pierre de l'Estache
Pierre de L'Estoile
Akihito
Nadine Schröder
Anna A. Møller
D. Pedro I
Pope Benedict XVI
Marsibil Ragnarsdóttir
Natanaël Morel
Isaac De la Croix
Jean-Michel Bozonnet
Qutaibah Mu'tazz Abadi
Rushd Jawna' Kassab
Khaldun Abdul-Qahhar Sabbag
'Awad Bashshar Asker
Al B. Zellweger
Gunnleif Snæ-Ulfsson
Käre Toresson
Sorli Ærnmundsson
Arnkel Øystæinsson
Ástríður Dórey
Åsmund Kåresson
Yahatti-Il
Ipqu-Annunitum
Nabu-zar-adan
Eskopas Cañaverri
Botolph of Langchester
Aelfhun the Cantrell
Fraco di Natale
Fraco Di Natale
Iván de Luca
Iván De Luca
Man'nah
Atabala Aüamusalü
Ramiz Ağasəfalu
Dadaş Aghakhanov
Fÿrxad Mübarizlı
Vaclaǔ Šupa
Yakiv Volacič
Flor Van Vaerenbergh
Flor van Vaerenbergh
Edwin van der Sar
Husein Ekmečić
Álvaro Guimarães Alencar
Phone U Yaza Arkar
Seocan MacGhille
X'wat'e Tlekadugovy
Albert-Jan Bootsveld
Maurits-jan Kuipers op den Kollenstaart
Elco ter Hoek
Robbert te Poele
Aad ten Have
'Ehu Kali
Ho'opa'a Loni
Aukanai'i Mahi'ai
Kalman ben Tal El
Żytomir Roszkowski
K'awai
Maol-Moire Mac'IlleBhuidh
Tòmas MacIlleChruim
Aindreas MacIllEathain
Eanruig MacGilleBhreac
Peadar MacGilleDhonaghart
Maolmhuire MacGill-Eain
Eanruig MacGilleBhreac
Wim van 't Plasman

Explanation

I used a define block to create definitions. You can look at each definition to see how it works. In general, I use \p{.} where . is replaced with some pointer to a Unicode character group (i.e \p{L} is any letter from any language - this will not work in most flavours of regex, but it does allow the regex to be much more simplified if available, which is why I used it).

If you need anything else explained, don't hesitate to ask me and I'll do my best, but regex101 should be able to explain anything you're wondering about regex.

ctwheels
  • 21,901
  • 9
  • 42
  • 77
  • Nice! worked on Regex101, gonna check more later. Do you know if Excel Regex 5.5 accepts unicode? `\p{L}\p{Nl}` – danieltakeshi Oct 06 '17 at 21:00
  • 1
    @danieltakeshi I don't think it'll accept any of this regular expression, unfortunately. It is, after all, Microsoft haha (they like to reinvent the wheel and make sure the wheel is square with rounded corners so that it's "different" - works, but ineffective). But you can break it down and build the regex from what I presented (it works closely to how code would in terms of how it's constructed, so, assuming you have a background in coding, you should be able to manipulate it properly). I presented it the way I did so that it can easily be understood (rather than a long *idkwhatitsays* regex) – ctwheels Oct 06 '17 at 21:28
  • 1
    Also, as a side note, it's a bad idea to validate names. I know this is a personal project, but never validate names (you can validate that, for example, there's at least 1 character and it's not invisible, but that's about it). See [how can i validate a name middle name and last name using regex in java](https://stackoverflow.com/questions/672855/how-can-i-validate-a-name-middle-name-and-last-name-using-regex-in-java?rq=1) and [personal names in a global application what to store](https://stackoverflow.com/questions/620118/personal-names-in-a-global-application-what-to-store) for more info. – ctwheels Oct 06 '17 at 21:42