2

I have a regular expression /\s*,\s*/ that matches left spaces followed by comma then right spaces.

Example:

var str = "john,walker    james  , paul";
var arr = str.split(/\s*,\s*/);
Values in arr = [john,walker james,paul] // Size: 3

Example with Chinese characters:

var str = "继续,取消   继续 ,取消";
var arr = str.split(/\s*,\s*/);
Values in arr = ["继续,取消   继续 ,取消"] // Size: 1, All values at index 0 no splitting happened

Tried splitting characters with unicodes:

var str = "john,walker    james  , paul";
var arr = str.split(/\u0020*\u002C\u0020*/);
Values in arr = [john,walker james,paul] // Size: 3

var str = "继续,取消   继续 ,取消";
var arr= str.split(/\u0020*\u002C\u0020*/);
Values in arr = ["继续,取消   继续 ,取消"] // Size: 1, All values at index 0 no splitting happened

I went through this link but not much info was there that I can use in my scenario. Is it really impossible to create regex for Chinese characters and split them?

Shawn
  • 47,241
  • 3
  • 26
  • 60
quintin
  • 812
  • 1
  • 10
  • 35

4 Answers4

7

As of 2018, there is increasing support for new Regex features in JavaScript, so to match Chinese, you just do this:

const REGEX = /(\p{Script=Hani})+/gu;
'你好'.match(REGEX);
// ["你好"]

The trick is to use \p and use the right script name, Hani stands for Han script (Chinese). The full list of scripts is here: http://unicode.org/Public/UNIDATA/PropertyValueAliases.txt

2

An ASCII comma won't match the comma you have in Chinese text. Either replace the ASCII comma (\x2C) with the Chinese one (\uFF0C), or use a character class [,,] to match both:

var str = "继续,取消   继续 ,取消";
console.log(str.split(/\s*[,,]\s*/));

Here is a regex that will match all the commas mentioned on the Comma Wikipedia page:

/\s*(?:\uD805\uDC4D|\uD836\uDE87|[\u002C\u02BB\u060C\u2E32\u2E34\u2E41\u2E49\u3001\uFE10\uFE11\uFE50\uFE51\uFF0C\uFF64\u00B7\u055D\u07F8\u1363\u1802\u1808\uA4FE\uA60D\uA6F5\u02BD\u0312\u0313\u0314\u0315\u0326\u201A])\s*/

Note that U+1144D (NEWA COMMA) and U+1DA87 (SIGNWRITING COMMA) have to be transpiled as \uD805\uDC4D and \uD836\uDE87 in order to be compatible with the ES5 regex standard.

The following commas are handled:enter image description here

Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • Value of str can be in any language, will this solution works with any locale? – quintin Jun 21 '17 at 07:15
  • @quintin: The `\s` matches any Unicode whitespace in JS regex. As for the commas, you need to create a character class including all commas there are in the Unicode table if you need to support all Unicode commas. There is no special class for commas, and moreover, JS regex just does not even support Unicode category classes (those `\p{...}` ones). Acc. to [this site](https://www.compart.com/en/unicode/based/U+002C) there are 3 Unicode commas: `[︐﹐,,]`. Also, see the [comma Wiki page](https://en.wikipedia.org/wiki/Comma) for more comma codes. – Wiktor Stribiżew Jun 21 '17 at 07:17
  • Thanks @Wiktor this is very helpful – quintin Jun 21 '17 at 08:01
1

I did it and it works!

var re1 = new RegExp("^[\u4E00-\uFA29]*$"); //Chinese character range 
var re2 = new RegExp("^[\uE7C7-\uE7F3]*$"); //Chinese character range
str = str.replace(/(^\s*)|(\s*$)/g,'');
if ((re1.test(str)) || (re2.test(str))) {
  console.log('CHINESE CHAR');
}
Zvi
  • 577
  • 6
  • 19
0

just using vanilla javascript

const str = "继续,取消   继续 ,取消";

// replace all Chinese comma to English comma
const arr = str.replace(/,/ig, `,`).split(`,`);


console.log(`result arr`, arr);
xgqfrms
  • 10,077
  • 1
  • 69
  • 68