How to detect Chinese character with punctuation in regex?

Question

I have noticed that there are question asked on how to detect Chinese character with regex. These are some question I read on stackoverflow:

Php check if the string has Chinese chars

detecting chinese characters in php string

And some article outside stackoverflow:

http://www.regular-expressions.info/unicode.html - unicode script

Basically they suggest using either \p{Han}+ or [\x{4e00}-\x{9fa5}]+.* to detect Chinese character. Is there a way to detect Chinese punctuation mark as well?

Some example (but not all) of Chinese punctuation mark: ：？。，《》「」『』－（）【】

I thought only regex would be enough..... but if it is necessary to do that in code (an make sure it is not hard coding every punctuation mark for checking), php solution would be fine too. — cytsunny, Jun 15 '18 at 06:02
It might be not so precise, but probably in such cases, you may match punctuation but ASCII punctuation, [`'~(?![[:ascii:]])\p{P}~u'`](https://regex101.com/r/Mo0l7v/1) or `~(?![[:ascii:]])[[:punct:]]~u`. — Wiktor Stribiżew, Jun 18 '18 at 23:23
@cytsunny the answer would depend on the regex flavor used, so please specify what flavor(s) you have in mind. — ivan_pozdeev, Jun 20 '18 at 02:31
@cytsunny - Can I link to the answer you selected as the definition of Chinese punctuation? What are they btw ? — , Jun 21 '18 at 11:32

score 5 · Answer 1 · edited Dec 24 '19 at 23:30

5

[\u4e00-\u9fa5] works for me in VSCode, the other suggested solutions don't. I stumbled on the solution here, which is a online regex interpreter: http://tool.chinaz.com/regex/

edited Dec 24 '19 at 23:30

Dharman

30,962
25
85
135

answered Dec 24 '19 at 14:35

Leon

356
6
10

wp78de · Accepted Answer · 2018-06-21T03:20:20.317

I suggest informing yourself by taking a look at Zhon, a Python library that provides constants commonly used in Chinese text processing.

Luckily, hanzi.py contains a definition of a regex that should pretty much suit your needs:

#: A regular expression pattern for a Chinese sentence. A sentence is defined
#: as a series of characters and non-stop punctuation marks followed by a stop
#: and zero or more container-closing punctuation marks (e.g. apostrophe or brackets).

sent = sentence = '[{characters}{radicals}{non_stops}]*{sentence_end}'.format(
    characters=characters, radicals=radicals, non_stops=non_stops,
    sentence_end=_sentence_end)

The definition above results in the following regex^*:

[〇一-鿿㐀-䶿豈-﫿----⼀-⿕⺀-⻳＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､　、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[！？｡。][」﹂”』’》）］｝〕〗〙〛〉】]*

Code Example:

preg_match_all('/[〇一-鿿㐀-䶿豈-﫿----⼀-⿕⺀-⻳＂＃＄％＆＇（）＊＋，－／：；＜＝＞＠［＼］＾＿｀｛｜｝～｟｠｢｣､　、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[！？｡。][」﹂”』’》）］｝〕〗〙〛〉】]*/', "我的中文不好。我是意大利人。你知道吗？", $matches, PREG_SET_ORDER, 0);
var_dump($matches);

If you prefer using Character code ranges for pertinent CJK ideograph Unicode blocks reference the Python source I have linked or get it from the Javascript sample below:

const regex = /[\u3007u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\u20000-\u2A6DF\u2A700-\u2B73F\u2B740-\u2B81F\u0002F800-\u2FA1F\u2F00-\u2FD5\u2E80-\u2EF3\uFF02\uFF03\uFF04\uFF05\uFF06\uFF07\uFF08\uFF09\uFF0A\uFF0B\uFF0C\uFF0D\uFF0F\uFF1A\uFF1B\uFF1C\uFF1D\uFF1E\uFF20\uFF3B\uFF3C\uFF3D\uFF3E\uFF3F\uFF40\uFF5B\uFF5C\uFF5D\uFF5E\uFF5F\uFF60\uFF62\uFF63\uFF64\u3000\u3001\u3003\u3008\u3009\u300A\u300B\u300C\u300D\u300E\u300F\u3010\u3011\u3014\u3015\u3016\u3017\u3018\u3019\u301A\u301B\u301C\u301D\u301E\u301F\u3030\u303E\u303F\u2013\u2014\u2018\u2019\u201B\u201C\u201D\u201E\u201F\u2026\u2027\uFE4F\uFE51\uFE54\u00B7]*[\uFF01\uFF1F\uFF61\u3002][」﹂”』’》）］｝〕〗〙〛〉】]*/gm;
const str = `我的中文不好。我是意大利人。你知道吗？`;
let m;

while ((m = regex.exec(str)) !== null) {
    // This is necessary to avoid infinite loops with zero-width matches
    if (m.index === regex.lastIndex) {
        regex.lastIndex++;
    }
    
    // The result can be accessed through the `m`-variable.
    m.forEach((match, groupIndex) => {
        console.log(`Found match, group ${groupIndex}: ${match}`);
    });
}

PS: I also found this answer helpful.

score 2 · Answer 3 · 2018-06-20T01:57:11.327

Is there a way to detect Chinese punctuation mark as well?

This is a basic Unicode property regex to get just the punctuation.
It says get all \p{Han} script characters, only if they are in the CJK block
for symbols and punctuation.

This effectively filters to Han symbols and punctuation.

As of Unicode 10, this turns out to be these 15 characters: 々〇〡〢〣〤〥〦〧〨〩〸〹〺〻

\p{Han}(?<=\p{Block=CJK_Symbols_And_Punctuation})

If the engine you're using doesn't support properties or doesn't support the
block symbols, you can use a class range of the resulting 15 chars.
which is one of these

[\x{3005}\x{3007}\x{3021}-\x{3029}\x{3038}-\x{303B}]
[\u3005\u3007\u3021-\u3029\u3038-\u303B]

Source: http://www.regexformat.com UCD Interface.

How to detect Chinese character with punctuation in regex?

3 Answers3