2

With the answers in

I have gotten close to what I need: get all the Chinese punctuations in the string.

And Intl.Segmenter is much better than String.prototype.split(" ")

But with one problem /\p{P}/u.test(segment.segment) test all the punctuations, not just Chinese punctuation, so I get English punctuation like apostrophe, comma, question mark and period.

I hope I need not to resolve to the answer in Chinese punctuation Unicode range?. It is too complicated. According to this wiki about Chinese punctuation there are only about 20.

So is there any easy way to do that ?

const str = "你好,让我们试试这个分词效果,你说怎么样?Let's try Intl.Segmenter, should we ?"
let segmenterZH = new Intl.Segmenter('zh', { granularity: 'grapheme' })
let segments = segmenterZH.segment(str)
for (let segment of segments) {
  if (/\p{P}/u.test(segment.segment)) {
    console.log(`${segment.index}:${segment.segment}`)
  }
}

--- update ---

I would like to add some new finding, partially inspired by Use regular expression to match characters appearing in Traditional Chinese ONLY :

  1. If I want to get all the Chinese character without punctuations, I can use /\p{sc=Han}/ as https://javascript.info/regexp-unicode said.

  2. I further tried what /\p{scx=Han}/can get, as Script_Extensions explains, but I only got 2 more Chinese punctuations, 《 》 while missed other Chinese punctuations.

@WiktorStribiżew's answer may explain that as these two punctuations 《 》 fall in the range of CJK Symbols and Punctuation while other Chinese punctuations fall in Halfwidth and Fullwidth Forms range. But I still think that is a bug for /\p{scx=Han}/

Qiulang
  • 10,295
  • 11
  • 80
  • 129
  • 1
    Just create a character class out of the Chinese punctuation chars. It seems most convenient since they are not that many. Also, it will be compatible with more ECMAScript standards. – Wiktor Stribiżew Apr 21 '23 at 08:21
  • @WiktorStribiżew this is what I thought now, but I was hoping maybe there is already some solution out there. – Qiulang Apr 21 '23 at 08:25
  • Ok, what about just excluding ASCII punctuation/symbols? ``/[^\P{P},.\/\\?;':"[\]{}!@#%&*()_-]/u``? This will match any Unicode punctuation symbols other than ASCII. – Wiktor Stribiżew Apr 21 '23 at 08:42
  • 1
    Another idea: if [CJK Symbols and Punctuation](http://www.unicode.org/charts/PDF/U3000.pdf) is what you are after, you can use `/\p{P}(?<=[\u3000-\u303F])/u` – Wiktor Stribiżew Apr 21 '23 at 08:54
  • @WiktorStribiżew `/\p{P}(?<=[\u3000-\u303F])/u` seems to work, let me further test it and get back to you! Thanks. – Qiulang Apr 21 '23 at 09:08
  • 1
    Also, if you want to include [half- and full-width](https://www.unicode.org/charts/PDF/UFF00.pdf) punctuation, use `/\p{P}(?<=[\u3000-\u303F\uFF00-\uFFEF])/u` – Wiktor Stribiżew Apr 21 '23 at 09:13
  • @WiktorStribiżew I was about to say /\p{P}(?<=[\u3000-\u303F])/u won't work as we need to add full-width and you added your comment! YES /\p{P}(?<=[\u3000-\u303F\uFF00-\uFFEF])/u work. How about you add your answer and I accept it ? – Qiulang Apr 21 '23 at 09:20

1 Answers1

2

In case you want to match punctuation proper that belongs to the CJK Symbols and Punctuation set, or a Halfwidth and Fullwidth Forms charset, you can use

/\p{P}(?<=[\u3000-\u303F\uFF00-\uFFEF])/u

where

  • \p{P} - matches any punctuation proper char (i.e. it does not match math symbols like + or =, etc.)
  • (?<=[\u3000-\u303F\uFF00-\uFFEF]) - a positive lookbehind that requires the char matched by \p{P} to fall in either the \u3000-\u303F (CJK Symbols and Punctuation) or \uFF00-\uFFEF (Halfwidth and Fullwidth Forms) range.

See a JavaScript demo below:

const str = "你好,让我们试试这个分词效果,你说怎么样?Let's try Intl.Segmenter, should we ?"
let segmenterZH = new Intl.Segmenter('zh', { granularity: 'grapheme' })
let segments = segmenterZH.segment(str)
for (let segment of segments) {
  if (/\p{P}(?<=[\u3000-\u303F\uFF00-\uFFEF])/u.test(segment.segment)) {
    console.log(`${segment.index}:${segment.segment}`)
  }
}

Output:

2:,
14:,
20:?

v flag supoport scenario

If your JavaScript environment supports the v flag, you can use a character class intersection:

const str = "你好,让我们试试这个分词效果,你说怎么样?Let's try Intl.Segmenter, should we ?";
for (let m of str.matchAll(/[\p{P}&&[\u3000-\u303F\uFF00-\uFFEF]]/gv)) {
   console.log(`${m.index}:${m[0]}`)
}
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • One thing I have not fully got is `/\p{P}(?<=[\uFF00-\uFFEF])/u` seems enough as I have tested all the Chinese punctuation I can think of. So why we don't need `\u3000-\u303F` here ? – Qiulang Apr 21 '23 at 09:30
  • @Qiulang I explained what that does in the answer, also cf. the [regex without the "CJK Symbols and Punctuation" support](https://regex101.com/r/XB8x7z/1) vs. [with this support](https://regex101.com/r/XB8x7z/2). – Wiktor Stribiżew Apr 21 '23 at 09:36
  • What I would like to argue is that the claim`\u3000-\u303F` for CJK Symbols and Punctuation seems not correct at least not for Chinese Punctuation. Because Chinese Punctation always use fullwidth form so they are in `\uFF00-\uFFEF` not in `\u3000-\u303F` – Qiulang Apr 21 '23 at 09:49
  • 1
    @Qiulang I have contacted my colleague who is a Chinese expert, and he states that the half/fullwidth part does not cover the `"【 】 〔 〕 《 》"` Chinese punctuation characters that the CJK Symbols and Punctuation covers. (He also includes `≦` and `≧` chars, but these are from the `\p{S}` category.) – Wiktor Stribiżew Apr 21 '23 at 10:19
  • Hi I further think about your answer and still have 2 questions. First https://stackoverflow.com/questions/2973436/regex-lookahead-lookbehind-and-atomic-groups and https://javascript.info/regexp-lookahead-lookbehind seem to called your regex **lookahead** not **lookbehind**. Second, no matter it is lookahead or lookbehind they are to check the **next** character, while your regex is to **further** check the matched character, so I am still amazed why that works! – Qiulang Apr 23 '23 at 04:39
  • @Qiulang My regex is not called anyhow, it just contains a Unicode category/property class (`\p{P}`) and a positive lookbehind (`(?<=...)`). You are amazed because you think a my lookbehind is a lookahead. – Wiktor Stribiżew Apr 23 '23 at 10:50
  • Oh, so your lookbehind actually looks for anything(or nothing) as long as it was preceded by Chinese punctuation. Now I fully understand. How clever! – Qiulang Apr 23 '23 at 13:43
  • 1
    https://v8.dev/features/regexp-v-flag#intersection is what I was looking for. It is easier to understand than lookbehind. Thanks! – Qiulang May 19 '23 at 08:58