how to regex chinese characters in C#?

Question

Am trying to use the regex in C# to match chinese characters.

\p{Han}+

However, C# fails to run, saying Unknown property Han

Are you using string interpolation? If so, you need to [escape the curly braces](https://stackoverflow.com/questions/31333096/how-to-use-escape-characters-with-string-interpolation-in-c-sharp-6). — Kenneth K., Jan 18 '19 at 17:53
Take a look at: https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#SupportedUnicodeGeneralCategories — Flydog57, Jan 18 '19 at 17:56
There are some CJK related named blocks in [that list](https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#SupportedUnicodeGeneralCategories). No 'IsHan' though. — LukStorms, Jan 18 '19 at 18:16
See https://stackoverflow.com/a/11817023/3832970. One remark: even if you use Unicode property classes, astral chars won't get matched. You need to match surrogate pairs and their ranges for this chars. — Wiktor Stribiżew, Jan 18 '19 at 18:19
But that docs.microsoft link also says: _"You can determine the Unicode category of any particular character by passing that character to the `GetUnicodeCategory` method."_ Maybe you could try a selection of characters and see what that returns — Flydog57, Jan 18 '19 at 19:13
Based on [this post](https://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode) it seems that you'd have be chinese to understand which unicode ranges are needed... — LukStorms, Jan 19 '19 at 08:57

score 0 · Answer 1 · edited Jan 31 '20 at 05:27

Theoretically we can accomplish the requirement by Unicode Script of regular expression.

But, C# doesn't support Unicode Script (but Unicode Categories are fine.)

It'll throw ArgumentException like this:

[System.ArgumentException: parsing "\p{Han}+" - Unknown property 'Han'.]

at System.Text.RegularExpressions.RegexCharClass.SetFromProperty(String capname, Boolean invert, String pattern)
at System.Text.RegularExpressions.RegexCharClass.AddCategoryFromName(String categoryName, Boolean invert, Boolean caseInsensitive, String pattern)
at System.Text.RegularExpressions.RegexParser.ScanBackslash()
at System.Text.RegularExpressions.RegexParser.ScanRegex()
at System.Text.RegularExpressions.RegexParser.Parse(String re, RegexOptions op)
at System.Text.RegularExpressions.Regex..ctor(String pattern, RegexOptions options, TimeSpan matchTimeout, Boolean useCache)
at System.Text.RegularExpressions.Regex..ctor(String pattern)

Detailed infos are referenced here.

score 0 · Answer 2 · answered Jun 18 '21 at 08:46

In .Net, you need to prepend Is to Unicode block properties.

I don't know what the corresponding block is for Han, or if it's supported, but you can try:

\p{IsHan}+

See MSDN for a list of supported types.

This works for other alphabets. See an example for Greek and Latin.

score 0 · Answer 3 · answered Apr 07 '22 at 02:38

dotnet platform regex match chinese characters:

\p{IsCJKUnifiedIdeographs}+

https://en.wikipedia.org/wiki/CJK_Unified_Ideographs

https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#supported-named-blocks

score -2 · Answer 4 · answered Jan 18 '19 at 17:58

This might work:

\p{L}

That would allow letters from any alphabet, if you want only Chinese character (no English ones) then I may need more time.

Also I am assuming you are using Regex correctly, test this code with \p{Han}+ to see if it still does not work.

        Regex regex = new Regex(@"\p{Han}+");///the requirement.
        Match match = regex.Match("YourString");
        if (match.Success)
        {
            Console.WriteLine("MATCH VALUE: " + match.Value);
        }

how to regex chinese characters in C#?

4 Answers4