Foreign language characters in Regular expression in C#

Question

In C# code, I am trying to pass chinese characters: " 中文ABC123".

When I use alphanumeric in general using "^[a-zA-Z0-9\s]+$",

it doesn't pass for "中文ABC123" and regex validation fails.

What other expressions do I need to add for C#?

Andie2302 · Accepted Answer · 2015-01-26T21:40:35.777

44

To match any letter character from any language use:

\p{L}

If you also want to match numbers:

[\p{L}\p{Nd}]+

\p{L} ... matches a character of the unicode category letter.
                it is the short form for [\p{Ll}\p{Lu}\p{Lt}\p{Lm}\p{Lo}]
                  \p{Ll} ... matches lowercase letters. (abc)
                  \p{Lu} ... matches uppercase letters. (ABC)
                  \p{Lt} ... matches titlecase letters.
                  \p{Lm} ... matches modifier letters.
                  \p{Lo} ... matches letters without case. (中文)

\p{Nd} ... matches a character of the unicode category decimal digit.

Just replace: ^[a-zA-Z0-9\s]+$ with ^[\p{L}0-9\s]+$

edited Jan 26 '15 at 21:40

answered Jan 26 '15 at 18:55

Andie2302

4,825
4
24
43

Or, if punctuation is OK, the simpler `\w` ([word character](https://msdn.microsoft.com/en-us/library/20bw873z.aspx#WordCharacter)) can be used instead of `[\p{L}0-9]`. – bzlm Jan 26 '15 at 19:33
1

By the way Andie2302, there is a huge conflict of this one with html5 Pattern, I was getting this one for HTML5 pattern attribute and it failed to validate. Do you have any idea to work witrh HTML5 Pattern attirbute for all the languages? – user2683269 Jan 26 '15 at 20:57
6

@user2683269 JavaScript (and hence html5 input patterns) doesn't support `\p`, and treats `\w` as "latin word character", so it's trickier there: http://stackoverflow.com/a/22075070/7724 – bzlm Jan 26 '15 at 21:17
besides Chinese and Japanese characters, what other languages does `\p{Lo}` might capture? – Yoav Feuerstein Oct 18 '17 at 15:06
2

@bzlm a bit further info on `\w` in .NET: https://stackoverflow.com/a/2998550/2246411 (note that `\w` does not work for all languages if using ECMAScript-compliant behavior – derekantrican May 19 '19 at 16:26
String: IŠMIN-AS-AK-AŠ/20 Pattern: "/IŠMIN-AS-AK-\p{L}{2,}/" Result: ^ b"IÅ MIN-AS-AK-AÅ" How solve this? – keizah7 Aug 17 '22 at 05:59

score 3 · Answer 2 · answered Jun 14 '19 at 18:55

Thanks to @Andie2302 for pointing to the right way to do it.

In Addition, for many language in the world, it's still has the 'addition character' that require main character to generate it (ex. Thai word 'เก็บ' if use only \p{L} it will display only 'เกบ', you can see that some symbolic will be missing from the word).

That's why only \p{L} will not work for all foreign language.

So, you need to use code below, to support almost foreign language

\p{L}\p{M}

NOTE:

L stand for 'Letter' (All letter from all language, but does not include the 'Mark')

M stand for 'Mark' (The 'Mark' cannot display alone, it require 'Letter' to display it)

In Addition that you need Number, use code below

\p{N}

NOTE:

N stand for 'Numeric'

Thanks to this website for very useful information

https://www.regular-expressions.info/unicode.html

Foreign language characters in Regular expression in C#

2 Answers2

Linked