based on this example What's the complete range for Chinese characters in Unicode?
does the letter "s" belog to this alphabet?
var r = /[\u20000-\u2A6DF]/;
var t = 'sad';
console.log(t.match(r))
outpus ["s"]
Why?
based on this example What's the complete range for Chinese characters in Unicode?
does the letter "s" belog to this alphabet?
var r = /[\u20000-\u2A6DF]/;
var t = 'sad';
console.log(t.match(r))
outpus ["s"]
Why?
The regex you have contains astral code points:
Astral code points are pretty easy to recognize: if you need more than 4 hexadecimal digits to represent the code point, it’s an astral code point.
These code points are outside of Basic Multilingual Plane (BMP) that can be used in JavaScript regex (e.g. \u00XD
).
However, JavaScript regex engine does not support astral code points (with the current ECMAScript implementation, it is already present in ECMAScript6, see Unicode code point escapes).
Thus, the problem arises when JavaScript regex engine tries to interpret the regex pattern: it "sees" \u2000
, then 0
, then -
, then \u2A6D
, then F
inside your character class. Then, the engine creates a range between 0
and \u2A6D
(⩭
), which is a very large amount of characters, actually, and all English letters, and a lot more can be matched with this regex.
In the Javascript unicode string, chinese character but no punctuation post, you can find a comprehensive Chinese character regex for JavaScript that consists of possible Unicode code point combinations used in Chinese, but there are a couple of typos in it.
Here is a working snippet:
var r = /(?:[\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d])+/g;
var t = '我的中文不好。我是意大利人。你知道吗?';
console.log(t.match(r));