-1

Am trying to use the regex in C# to match chinese characters.

\p{Han}+

However, C# fails to run, saying Unknown property Han

And Wan
  • 314
  • 1
  • 3
  • 12
  • Are you using string interpolation? If so, you need to [escape the curly braces](https://stackoverflow.com/questions/31333096/how-to-use-escape-characters-with-string-interpolation-in-c-sharp-6). – Kenneth K. Jan 18 '19 at 17:53
  • 1
    Take a look at: https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#SupportedUnicodeGeneralCategories – Flydog57 Jan 18 '19 at 17:56
  • 1
    There are some CJK related named blocks in [that list](https://learn.microsoft.com/en-us/dotnet/standard/base-types/character-classes-in-regular-expressions#SupportedUnicodeGeneralCategories). No 'IsHan' though. – LukStorms Jan 18 '19 at 18:16
  • See https://stackoverflow.com/a/11817023/3832970. One remark: even if you use Unicode property classes, astral chars won't get matched. You need to match surrogate pairs and their ranges for this chars. – Wiktor Stribiżew Jan 18 '19 at 18:19
  • But that docs.microsoft link also says: _"You can determine the Unicode category of any particular character by passing that character to the `GetUnicodeCategory` method."_ Maybe you could try a selection of characters and see what that returns – Flydog57 Jan 18 '19 at 19:13
  • Based on [this post](https://stackoverflow.com/questions/1366068/whats-the-complete-range-for-chinese-characters-in-unicode) it seems that you'd have be chinese to understand which unicode ranges are needed... – LukStorms Jan 19 '19 at 08:57

4 Answers4

0

Theoretically we can accomplish the requirement by Unicode Script of regular expression.

But, C# doesn't support Unicode Script (but Unicode Categories are fine.)

It'll throw ArgumentException like this:

[System.ArgumentException: parsing "\p{Han}+" - Unknown property 'Han'.]

at System.Text.RegularExpressions.RegexCharClass.SetFromProperty(String capname, Boolean invert, String pattern)
at System.Text.RegularExpressions.RegexCharClass.AddCategoryFromName(String categoryName, Boolean invert, Boolean caseInsensitive, String pattern)
at System.Text.RegularExpressions.RegexParser.ScanBackslash()
at System.Text.RegularExpressions.RegexParser.ScanRegex()
at System.Text.RegularExpressions.RegexParser.Parse(String re, RegexOptions op)
at System.Text.RegularExpressions.Regex..ctor(String pattern, RegexOptions options, TimeSpan matchTimeout, Boolean useCache)
at System.Text.RegularExpressions.Regex..ctor(String pattern)

Detailed infos are referenced here.

marc_s
  • 732,580
  • 175
  • 1,330
  • 1,459
TomCW
  • 51
  • 6
0

In .Net, you need to prepend Is to Unicode block properties.

I don't know what the corresponding block is for Han, or if it's supported, but you can try:

\p{IsHan}+

See MSDN for a list of supported types.

This works for other alphabets. See an example for Greek and Latin.

alelom
  • 2,130
  • 3
  • 26
  • 38
-2

This might work:

\p{L}

That would allow letters from any alphabet, if you want only Chinese character (no English ones) then I may need more time.

Also I am assuming you are using Regex correctly, test this code with \p{Han}+ to see if it still does not work.

        Regex regex = new Regex(@"\p{Han}+");///the requirement.
        Match match = regex.Match("YourString");
        if (match.Success)
        {
            Console.WriteLine("MATCH VALUE: " + match.Value);
        }