why regex match letter s in CJK Unified Ideographs Extension B unicode 20000-2A6DF?

Question

based on this example What's the complete range for Chinese characters in Unicode?

does the letter "s" belog to this alphabet?

var r = /[\u20000-\u2A6DF]/;
var t = 'sad';
console.log(t.match(r))

outpus ["s"]

Why?

In JS, you can only use `\u`+4-symbol sequence to match code points. — Wiktor Stribiżew, Aug 26 '15 at 12:05
CJK it is a commonly used acronym for "Chinese, Japanese, and Korean" — Gigi Ionel, Aug 26 '15 at 12:05
so why on unicode.org it says that: CJK Unified Ideographs Extension B Range: 20000–2A6D6 — Gigi Ionel, Aug 26 '15 at 12:08
It is correct. Does this [*Javascript unicode string, chinese character but no punctuation*](http://stackoverflow.com/questions/21109011/javascript-unicode-string-chinese-character-but-no-punctuation) post help you resolve your issue? — Wiktor Stribiżew, Aug 26 '15 at 12:13
These code points are outside of [*Basic Multilingual Plane*](https://mathiasbynens.be/notes/javascript-unicode). They cannot be handled in current JS implementation. — Wiktor Stribiżew, Aug 26 '15 at 12:17
It helped me. That was the right answer! I didn't know that it shouldn`t be more than 4 characters after \u. TY! — Gigi Ionel, Aug 26 '15 at 12:26
Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/87993/discussion-between-gigi-ionel-and-stribizhev). — Gigi Ionel, Aug 26 '15 at 12:28
it has common solution but not the same answer. If you see my question is why s is matched in that range, so the answer is because js does not support that range...how can you see this thing as duplicate? man...come on — Gigi Ionel, Aug 27 '15 at 06:18

score 2 · Accepted Answer · edited May 23 '17 at 11:51

The regex you have contains astral code points:

Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.

These code points are outside of Basic Multilingual Plane (BMP) that can be used in JavaScript regex (e.g. \u00XD). However, JavaScript regex engine does not support astral code points (with the current ECMAScript implementation, it is already present in ECMAScript6, see Unicode code point escapes).

Thus, the problem arises when JavaScript regex engine tries to interpret the regex pattern: it "sees" \u2000, then 0, then -, then \u2A6D, then F inside your character class. Then, the engine creates a range between 0 and \u2A6D (⩭), which is a very large amount of characters, actually, and all English letters, and a lot more can be matched with this regex.

In the Javascript unicode string, chinese character but no punctuation post, you can find a comprehensive Chinese character regex for JavaScript that consists of possible Unicode code point combinations used in Chinese, but there are a couple of typos in it.

Here is a working snippet:

var r = /(?:[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d])+/g;
var t = '我的中文不好。我是意大利人。你知道吗？';
console.log(t.match(r));

Looks like it will be possible in the next generation of browsers supporting ECMAScript 6. See [ECMAScript 6 Compatibility Table](https://kangax.github.io/compat-table/es6/). — Wiktor Stribiżew, Aug 27 '15 at 06:56
In Google Chrome, you can actually use ECMAScript 6 syntax. 1) Go to `chrome://flags/#enable-javascript-harmony` 2) click *Enable*. However, I did not manage to make it use the astral code points with RegExp :( — Wiktor Stribiżew, Aug 27 '15 at 08:35

why regex match letter s in CJK Unified Ideographs Extension B unicode 20000-2A6DF?

1 Answers1