I suggest informing yourself by taking a look at Zhon, a Python library that provides constants commonly used in Chinese text processing.
Luckily, hanzi.py contains a definition of a regex that should pretty much suit your needs:
#: A regular expression pattern for a Chinese sentence. A sentence is defined
#: as a series of characters and non-stop punctuation marks followed by a stop
#: and zero or more container-closing punctuation marks (e.g. apostrophe or brackets).
sent = sentence = '[{characters}{radicals}{non_stops}]*{sentence_end}'.format(
characters=characters, radicals=radicals, non_stops=non_stops,
sentence_end=_sentence_end)
The definition above results in the following regex*:
[〇一-鿿㐀-䶿豈-----⼀-⿕⺀-⻳"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[!?。。][」﹂”』’》)]}〕〗〙〛〉】]*
Code Example:
preg_match_all('/[〇一-鿿㐀-䶿豈-----⼀-⿕⺀-⻳"#$%&'()*+,-/:;<=>@[\]^_`{|}~⦅⦆「」、 、〃〈〉《》「」『』【】〔〕〖〗〘〙〚〛〜〝〞〟〰〾〿–—‘’‛“”„‟…‧﹏﹑﹔·]*[!?。。][」﹂”』’》)]}〕〗〙〛〉】]*/', "我的中文不好。我是意大利人。你知道吗?", $matches, PREG_SET_ORDER, 0);
var_dump($matches);
If you prefer using Character code ranges for pertinent CJK ideograph Unicode blocks reference the Python source I have linked or get it from the Javascript sample below:
const regex = /[\u3007u4E00-\u9FFF\u3400-\u4DBF\uF900-\uFAFF\u20000-\u2A6DF\u2A700-\u2B73F\u2B740-\u2B81F\u0002F800-\u2FA1F\u2F00-\u2FD5\u2E80-\u2EF3\uFF02\uFF03\uFF04\uFF05\uFF06\uFF07\uFF08\uFF09\uFF0A\uFF0B\uFF0C\uFF0D\uFF0F\uFF1A\uFF1B\uFF1C\uFF1D\uFF1E\uFF20\uFF3B\uFF3C\uFF3D\uFF3E\uFF3F\uFF40\uFF5B\uFF5C\uFF5D\uFF5E\uFF5F\uFF60\uFF62\uFF63\uFF64\u3000\u3001\u3003\u3008\u3009\u300A\u300B\u300C\u300D\u300E\u300F\u3010\u3011\u3014\u3015\u3016\u3017\u3018\u3019\u301A\u301B\u301C\u301D\u301E\u301F\u3030\u303E\u303F\u2013\u2014\u2018\u2019\u201B\u201C\u201D\u201E\u201F\u2026\u2027\uFE4F\uFE51\uFE54\u00B7]*[\uFF01\uFF1F\uFF61\u3002][」﹂”』’》)]}〕〗〙〛〉】]*/gm;
const str = `我的中文不好。我是意大利人。你知道吗?`;
let m;
while ((m = regex.exec(str)) !== null) {
// This is necessary to avoid infinite loops with zero-width matches
if (m.index === regex.lastIndex) {
regex.lastIndex++;
}
// The result can be accessed through the `m`-variable.
m.forEach((match, groupIndex) => {
console.log(`Found match, group ${groupIndex}: ${match}`);
});
}
PS: I also found this answer helpful.