Unicode characters in Regex

Question

I have a regular expression:

return Regex.IsMatch(_customer.FirstName, @"^[A-Za-z][A-Za-z0-9@#%&\'\-\s\.\,*]*$");

Now, some of the customers have a fada over a vowel in their surname or firstname like the following: Brendán

Note the fada over the a which you can get by holding down alt, ctrl and then pressing a.

I have tried adding these characters into the regular expression but I get an error when the program tries to compile.

The only way I can allow the user to enter such a character with a a fada is to remove the regular expression completely which means the user can enter anything they want.

Is there any way to use the above expression and somehow allow the following characters?

á
é
í
ó
ú

I found an important link here https://andrewwoods.net/blog/2018/name-validation-regex/ — Bagesh Sharma, Feb 04 '21 at 10:13

hwnd · Accepted Answer · 2013-12-17T22:01:14.333

Just for reference you don't need to escape the above ',. in your character class [], and you can avoid having to escape the dash - by placing it at the beginning or end of your character class.

You can use \p{L} which matches any kind of letter from any language. See the example below:

string[] names = { "Brendán", "Jóhn", "Jason" };
Regex rgx      = new Regex(@"^\p{L}+$");
foreach (string name in names)
    Console.WriteLine("{0} {1} a valid name.", name, rgx.IsMatch(name) ? "is" : "is not");

// Brendán is a valid name.
// Jóhn is a valid name.
// Jason is a valid name.

Or simply just add the desired characters to your character class [] you want to include.

@"^[a-zA-Z0-9áéíóú@#%&',.\s-]+$"

score 9 · Answer 2 · edited Jul 08 '20 at 08:29

9

Try incorporating \p{L} which will match a unicode "letter". So a and á should match against \p{L}.

edited Jul 08 '20 at 08:29

Zoe

27,060
21
118
148

answered Dec 17 '13 at 17:57

AFrieze

844
1
10
26

score 5 · Answer 3 · answered Dec 17 '13 at 18:24

To expand your regular expression to include vowels with an acute accent (fada), you can use Unicode code points. You need to know about these unicode blocks:

More Unicode code charts at http://www.unicode.org/charts/index.html#scripts, covering Latin Extended-B, -C and -D and Latin Extended-Addional (which ought to cover pretty much every European language in its entirety).

So, we see that the Irish fada vowels are

Á is \u00C1; á is \u00E1
É is \u00C9; é is \u00E9
Í is \u00CD; í is \u00ED
Ó is \u00D3; ó is \u00F3
Ú is \u00DA; ú is \u00FA

And thus your regular expression need to be extended:

Regex rx = new Regex( @"^[A-Za-z\u00C1\u00C9\u00CD\u00D3\u00DA\u00E1\u00E9\u00ED\u00F3\u00FA][A-Za-z\u00C1\u00C9\u00CD\u00D3\u00DA\u00E1\u00E9\u00ED\u00F3\u00FA0-9@#%&\'\-\s\.\,*]*$");

score 1 · Answer 4 · answered Dec 17 '13 at 17:58

1

\w (word characters) includes unicode characters.

So your expression could be:

@"^\w[\w0-9@#%&\'\-\s\.\,*]*$"

(Replacing A-Za-z with \w)

answered Dec 17 '13 at 17:58

driis

161,458
45
265
341

1

I thought the same thing, but it doesn't actually work as I expected either. http://regex101.com/r/pG5kS5 – Mike Perrenoud Dec 17 '13 at 18:05
The problem with the word character class (`\w`) is that it matches a lot of stuff: Unicode letters — categories `Ll` (lower-case), `Lu` (upper-case), `Lt` (title case), `Lo` (letter, other), `Lm` (letter, modifier), `Nd` (number, decimal digit...which includes more than just ASCII 0-9) and `Pc` (punctuation, connector). – Nicholas Carey Dec 17 '13 at 18:06
@MikePerrenoud There's no guarantee that PHP's regex library matches the behavior of C#'s, even if they're both PCRE. You can see from that link that the Python regex engine matches differently. – jpaugh Jul 14 '20 at 21:48

score 0 · Answer 5 · answered Dec 17 '13 at 18:09

0

Try like below. It will help you...

return Regex.IsMatch(_customer.FirstName, @"^[0-9A-Za-z@#%&\'\-\s\.\,ñáéíóúü]+$");

answered Dec 17 '13 at 18:09

Pandian

8,848
2
23
33

Unicode characters in Regex

5 Answers5

Linked