Is there some regular expression of a single Chinese character, which could be any Chinese characters that exists?
Recommendation
To match patterns with Chinese characters and other Unicode code points with a Flex-compatible lexical analyzer, you could use the RE/flex lexical analyzer for C++ that is backwards compatible with Flex. RE/flex supports Unicode and works with Bison to build lexers and parsers.
You can write Unicode patterns (and UTF-8 regular expressions) in RE/flex specifications such as:
%option flex unicode
%%
[肖晗] { printf ("xiaohan/2\n"); }
%%
Use global %option unicode
to enable Unicode. You can also use a local modifier (?u:)
to restrict Unicode to a single pattern (so everything else is still ASCII/8-bit as in Flex):
%option flex
%%
(?u:[肖晗]) { printf ("xiaohan/2\n"); }
(?u:\p{Han}) { printf ("Han character %s\n", yytext); }
. { printf ("8-bit character %d\n", yytext[0]); }
%%
Option flex
enables Flex compatibility, so you can use yytext
, yyleng
, ECHO
, and so on. Without the flex
option RE/flex expects Lexer method calls: text()
(or str()
and wstr()
for std::string
and std::wstring
), size()
(or wsize()
for wide char length), and echo()
. RE/flex method calls are cleaner IMHO, and include wide char operations.
Background
In plain old Flex I ended up defining ugly UTF-8 patterns to capture ASCII letters and UTF-8 encoded letters for a compiler project that required support for Unicode identifiers id
:
digit [0-9]
alpha ([a-zA-Z_\xA8\xAA\xAD\xAF\xB2\xB5\xB7\xB8\xB9\xBA\xBC\xBD\xBE]|[\xC0-\xFF][\x80-\xBF]*|\\u([0-9a-fA-F]{4}))
id ({alpha})({alpha}|{digit})*
The alpha
pattern supports ASCII letters, underscore, and Unicode code points that are used in identifiers (\p{L}
etc). The pattern permits more Unicode code points than absolutely necessary to keep the size of this pattern manageable, so it trades compactness for some lack of accuracy and to permit UTF-8 overlong characters in some cases that are not valid UTF-8. If you are thinking about this approach than be wary about the problems and safety concerns. Use a Unicode-capable scanner generator instead, such as RE/flex.
Safety
When using UTF-8 directly in Flex patterns, there are several concerns:
Encoding your own UTF-8 patterns in Flex for matching any Unicode character may be prone to errors. Patterns should be restricted to characters in the valid Unicode range only. Unicode code points cover the range U+0000 to U+D7FF and U+E000 to U+10FFFF. The range U+D800 to U+DFFF is reserved for UTF-16 surrogate pairs and are invalid code points. When using a tool to convert a Unicode range to UTF-8, make sure to exclude invalid code points.
Patterns should reject overlong and other invalid byte sequences. Invalid UTF-8 should not be silently accepted.
To catch lexical input errors in your lexer will require a special .
(dot) that matches valid and invalid Unicode, including UTF-8 overruns and invalid byte sequences, in order to produce an error message that the input is rejected. If you use dot as a "catch-all-else" to produce an error message, but your dot does not match invalid Unicode, then you lexer will hang ("scanner is jammed") or your lexer will ECHO rubbish characters on the output by the Flex "default rule".
Your scanner should recognize a UTF BOM (Unicode Byte Order Mark) in the input to switch to UTF-8, UTF-16 (LE or BE), or UTF-32 (LE or BE).
As you point out, patterns such as [unicode characters]
do not work at all with Flex because UTF-8 characters in a bracket list are multibyte characters and each single byte character can be matched but not the UTF-8 character.
See also invalid UTF encodings in the RE/flex user guide.