Use regular expression to match ANY Chinese character in utf-8 encoding

Question

For example, I want to match a string consisting of m to n Chinese characters, then I can use:

[single Chinese character regular expression]{m,n}

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

At the very least, please provide information on the regex engine you're using. — Lily Ballard, Mar 06 '12 at 00:55
@KevinBallard I am not quite sure which engine I am using. What I know is I use the regular expression functionality in `flex`(the lexer) — xiaohan2012, Mar 06 '12 at 01:00
Possible duplicate of [How to make a flex (lexical scanner) to read UTF-8 characters input?](https://stackoverflow.com/questions/921648/how-to-make-a-flex-lexical-scanner-to-read-utf-8-characters-input) — Thomas Dickey, Jul 08 '17 at 13:05
flex won't do this; answers which assume it does do not address the question. — Thomas Dickey, Jul 08 '17 at 13:06

score 47 · Answer 1 · answered Mar 06 '12 at 00:56

47

The regex to match a Chinese (well, CJK) character is

\p{script=Han}

which can be appreviated to simply

\p{Han}

This assumes that your regex compiler meets requirement RL1.2 Properties from UTS#18 Unicode Regular Expressions. Perl and Java 7 both meet that spec, but many others do not.

answered Mar 06 '12 at 00:56

tchrist

78,834
30
123
180

It is a pity that flex does not seem to support it. Thanks anyway. – xiaohan2012 Mar 06 '12 at 01:03
2

@xiaohan2012: I don't think flex really supports unicode at all – Lily Ballard Mar 06 '12 at 01:04
@xiaohan2012 If I were you, I would look into using using some of the lexing and parsing tools available under Perl or Java, which do support Unicode. – tchrist Mar 06 '12 at 04:06
3

Is there a variant of this for C# – tofutim Jun 11 '13 at 20:22
This worked perfect in Coda 2 find and replace. Thanks! – Jake Feb 26 '14 at 20:41
6

It gives `Unknown character property name {Han}` for Java-8. What's wrong? – Andremoniy Mar 21 '17 at 09:52
Perhaps use [RE/flex](https://github.com/Genivia/RE-flex) as an alternative to Flex? It handles Unicode quite well and uses the same Flex syntax. I've also been unhappy with the lack of progress with Flex, so I created a new version (RE/flex) that understands modern Unicode and character encodings. – Dr. Alex RE Jul 06 '17 at 15:57
I find this rule is not supported by Impala. – Jiaxiang Nov 22 '18 at 07:29
@Andremoniy I believe you can use: `\p{IsHan}` in Java: see https://docs.oracle.com/javase/tutorial/essential/regex/unicode.html – ptha May 19 '20 at 14:57

score 7 · Answer 2 · edited Jan 08 '16 at 05:11

7

In Java,

\p{InCJK_UNIFIED_IDEOGRAPHS}{1,3}

edited Jan 08 '16 at 05:11

Tushar

85,780
21
159
179

answered Jun 04 '14 at 03:20

DayDayHappy

1,679
1
15
26

Note that this only finds characters in the block from U+4E00–U+9FFF. It does not find all Chinese characters that exist. – martin Jun 29 '16 at 17:12
1

The question is tagged with the Flex lexer for C and C++ that does not support the `\p{C}` character block. – Dr. Alex RE Jul 08 '17 at 12:59

Artemious · Answer 3 · 2020-02-11T19:19:59.000

6

In C#

new Regex(@"\p{IsCJKUnifiedIdeographs}")

Here it is in the Microsoft docs

And here's more info from Wikipedia: CJK Unified Ideographs

The basic block named CJK Unified Ideographs (4E00–9FFF) contains 20,976 basic Chinese characters in the range U+4E00 through U+9FEF. The block not only includes characters used in the Chinese writing system but also kanji used in the Japanese writing system and hanja, whose use is diminishing in Korea. Many characters in this block are used in all three writing systems, while others are in only one or two of the three. Chinese characters are also used in Vietnam's Nôm script (now obsolete).

edited Feb 11 '20 at 19:19

answered Feb 11 '20 at 11:25

Artemious

1,980
1
20
31

1

Thanks for your answer! To help improve your post, please consider adding a link to documentation, or add an explanation to help explain what this does. – Kevin Feb 11 '20 at 13:25

Dr. Alex RE · Answer 4 · 2017-07-08T12:56:59.743

Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?

Recommendation

To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.

You can write Unicode patterns (and UTF-8 regular expressions) in RE/flex specifications such as:

%option flex unicode
%%
[肖晗]   { printf ("xiaohan/2\n"); }
%%

Use global %option unicode to enable Unicode. You can also use a local modifier (?u:) to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):

%option flex
%%
(?u:[肖晗])   { printf ("xiaohan/2\n"); }
(?u:\p{Han})  { printf ("Han character %s\n", yytext); }
.             { printf ("8-bit character %d\n", yytext[0]); }
%%

Option flex enables Flex compatibility, so you can use yytext, yyleng, ECHO, and so on. Without the flex option RE/flex expects Lexer method calls: text() (or str() and wstr() for std::string and std::wstring), size() (or wsize() for wide char length), and echo(). RE/flex method calls are cleaner IMHO, and include wide char operations.

Background

In plain old Flex I ended up defining ugly UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id:

digit           [0-9]
alpha           ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id              ({alpha})({alpha}|{digit})*

The alpha pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L} etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns. Use a Unicode-capable scanner generator instead, such as RE/flex.

Safety

When using UTF-8 directly in Flex patterns, there are several concerns:

Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.
Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.
To catch lexical input errors in your lexer will require a special . (dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".
Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).
As you point out, patterns such as [unicode characters] do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.

See also invalid UTF encodings in the RE/flex user guide.

score 1 · Answer 5 · answered Apr 08 '23 at 15:36

1

For most programming languages, the regular expression to match 99.9%+ Chinese characters will be:

`\u4E00-\u9FFF`

Works with: Python, modern Javascript, Golang, Rust but not PHP.

Useful if your language don't support notations like {Han}/{script=Han}/{IsCJKUnifiedIdeographs} in other answers.

NB: This corresponds to the CJK Unified Ideographs, and includes other languages like Korean, Japanese and Vietnamese.

answered Apr 08 '23 at 15:36

Eli O.

1,543
3
18
27

1

For input that accepts (or requires) Unicode encoded text, `[\u4E00-\u9FFF]` is equivalent to `[一-鿿]` – remcycles Apr 09 '23 at 13:53

dripp · Answer 6 · 2015-12-21T10:55:48.570

0

In Java 7 and up, the format should be: "\p{IsHan}"

edited Dec 21 '15 at 10:55

answered Apr 20 '15 at 10:03

dripp

147
1
5

1

actually, the edit history shows that you also wrote `InHan`, @Robert only added formatting so the expression appears monospaced – Zoltán Nov 12 '15 at 10:22
Hint: You can *choose* to edit it yourself to correct your error. ;-) – Robert Nov 12 '15 at 10:57
The question doesn't ask how to do it in Java, though. The question is tagged "flex-lexer". – Jul 10 '16 at 12:11

BiaowuDuan · Answer 7 · 2022-11-05T03:33:09.987

-1

just like this:

package main

import (
    "fmt"
    "regexp"
)

func main() {
    compile, err := regexp.Compile("\\p{Han}") // match one any Chinese character
    if err != nil {
        return
    }
    str := compile.FindString("hello 世界")
    fmt.Println(str) // output: 世
}

edited Nov 05 '22 at 03:33

answered Nov 04 '22 at 10:21

BiaowuDuan

39
5

1

Please read [answer] and [edit] your answer to contain an explanation as to why this code would actually solve the problem at hand. Always remember that you're not only solving the problem, but are also educating the OP and any future readers of this post. – Adriaan Nov 04 '22 at 10:23

Use regular expression to match ANY Chinese character in utf-8 encoding

7 Answers7

For most programming languages, the regular expression to match 99.9%+ Chinese characters will be:

`\u4E00-\u9FFF`

Linked

Related