Remove all spaces between Chinese words with regex

Question

I would like to remove all spaces among Chinese text only.

My text: "請把這裡的 10 多個字合併. Can you help me?"

Ideal output: "請把這裡的 10 多個字合併. Can you help me?"

var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
str = str.replace("/\&nbsp;/", "");

I have studied a similar question for Python but it seems not to work in my situation so I brought my question here for some help.

Does your spaces actually are ` ` or you just used it guessing? — Justinas, Jan 14 '19 at 10:01
Using the latest ECMAScript 2018 regex syntax you may use `s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1')` — Wiktor Stribiżew, Jan 14 '19 at 10:11
Info: the answers on this question also answers "*How to match Chinese characters in Javascript*". — user202729, Jan 15 '19 at 08:24

Wiktor Stribiżew · Accepted Answer · 2019-01-15T09:11:19.260

Getting to the Chinese char matching pattern

Using the Unicode Tools, the \p{Han} Unicode property class that matches any Chinese char can be translated into

[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\U00020000-\U0002A6D6\U0002A700-\U0002B734\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D]

In ES6, to match a single Chinese char, it can be used as

/[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\u{20000}-\u{2A6D6}\u{2A700}-\u{2B734}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2CEB0}-\u{2EBE0}\u{2F800}-\u{2FA1D}]/u

Transpiling it to ES5 using ES2015 Unicode regular expression transpiler, we get

(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])

pattern to match any Chinese char using JS RegExp.

So, you may use

s.replace(/([\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])\s+(?=(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D]))/g, '$1')

See the regex demo.

If your JS environment is ECMAScript 2018 compliant you may use a shorter

s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1')

Pattern details

(CHINESE_CHAR_PATTERN) - Capturing group 1 ($1 in the replacement pattern): any Chinese char
\s+ - any 1+ whitespaces (any Unicode whitespace)
(?=CHINESE_CHAR_PATTERN) - there must be a Chinese char immediately to the right of the current location.

JS demo:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?";
var HanChr = "[\\u2E80-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u3005\\u3007\\u3021-\\u3029\\u3038-\\u303B\\u3400-\\u4DB5\\u4E00-\\u9FEF\\uF900-\\uFA6D\\uFA70-\\uFAD9]|[\\uD840-\\uD868\\uD86A-\\uD86C\\uD86F-\\uD872\\uD874-\\uD879][\\uDC00-\\uDFFF]|\\uD869[\\uDC00-\\uDED6\\uDF00-\\uDFFF]|\\uD86D[\\uDC00-\\uDF34\\uDF40-\\uDFFF]|\\uD86E[\\uDC00-\\uDC1D\\uDC20-\\uDFFF]|\\uD873[\\uDC00-\\uDEA1\\uDEB0-\\uDFFF]|\\uD87A[\\uDC00-\\uDFE0]|\\uD87E[\\uDC00-\\uDE1D]"; 
console.log(s.replace(new RegExp('(' + HanChr + ')\\s+(?=(?:' + HanChr + '))', 'g'), '$1'));

A test for the regex compliant with the ECMAScript 2018 standard:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?";
console.log(s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1'));

FYI: if only one whitespace is expected between Chinese chars, remove `+` after `\s`. — Wiktor Stribiżew, Jan 14 '19 at 10:33
I get " { "message": "SyntaxError: invalid identity escape in regular expression", "filename": "https://stacksnippets.net/js", "lineno": 17, "colno": 22 }" When I run the snippet. (Using Firefox 62) — Pac0, Jan 14 '19 at 19:55
@Pac0 firefox has problems with "new" regexp e.g. [here](https://bugzilla.mozilla.org/show_bug.cgi?id=1361876) — Kamil Kiełczewski, Jan 14 '19 at 20:11
@Pac0 That is because of `/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu`, FF does not support ECMAScript 2018 Unicode property classes. Chrome does. — Wiktor Stribiżew, Jan 14 '19 at 20:46
Thanks Wiktor, I have compared the answers. And this seems would be the most detailed and worked answer to my question. — lewishole, Jan 15 '19 at 02:36
Problem with this is more Chinese characters will be added and this will end up not matching all of them. It might be a better idea to detect ES 2018 support and post the string for server-side processing otherwise. — billc.cn, Jan 18 '19 at 17:31
@billc.cn Or, just follow the above process to update the regex. — Wiktor Stribiżew, Jan 18 '19 at 18:18

Orelsanpls · Answer 2 · 2019-01-14T17:24:09.203

22

Using @Brett Zamir soluce on how to match chinese character in regex

Javascript unicode string, chinese character but no punctuation

const str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';

const regex = new RegExp('([\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d]) ([\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d])* ', 'g');

const ret = str.replace(regex, '$1$2');

console.log(ret);

It looks like :

([foo chinese chars]) ([foo chinese chars])*

edited Jan 14 '19 at 17:24

answered Jan 14 '19 at 10:09

Orelsanpls

22,456
6
42
69

2

The output here doesn't match with the ideal output. Notice the space in front of the 10. – holydragon Jan 14 '19 at 10:10
you lose the space before the 10 at the center of the chineses word but still you found the right way to select chinese characters :p – jonatjano Jan 14 '19 at 10:11
I'd use `\s+` instead of `' '` – yunzen Jan 14 '19 at 10:16
Thanks NEUT!! This answer is near what I need. But I am wondering some text stil not working. How can I make it fine? Example(1) Text: "最新消息" will changed to "最消息" Example(2) "最新消息" does nothing. – lewishole Jan 14 '19 at 11:13
2

@GrégoryNEUT `blabla` isn't a common [metasyntactic variable](https://en.wikipedia.org/wiki/Metasyntactic_variable) in English, you might want to use `foo` instead ;) – Aaron Jan 14 '19 at 17:14
2

This answer does not work when there are an even number of Chinese characters before the other text, such as the case that @bobblebubble mentions: `請的 10 多個 a`. It will remove too many spaces in that case. – fishinear Jan 14 '19 at 18:54
1

You should *really* break this down into simpler sub expressions. Some poor son of a bitch is going to have to debug that in 8 years. – Alexander Jan 15 '19 at 02:15

Pushpesh Kumar Rajwanshi · Answer 3 · 2019-01-14T10:36:28.177

Range for Chinese characters can be written as [\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC] so you can use this regex which selects a chinese character and a space and ensures it is followed by a chinese character by this look ahead (?=[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+),

([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)\s+(?=[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)

And replace it by $1

Demo

var str = '請 把把把把把 這 裡裡裡裡裡 的 10 多多多多 個 字 合 併. Can you help me?';
console.log(str.replace(/([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)\s+(?=[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)/g, "$1"));

Kamil Kiełczewski · Answer 4 · 2019-01-15T17:04:49.127

4

Try this

str.replace(/ ([\u4E00-\u9FCC])|([ -~]+ )/g, '$1$2');

Solution works witch ascii characters and chinsese letters with codes \u4E00-\u9FCC (I get them from here - it contains ~20000 chars enough for daily usage but not all Chinese letters).

var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
str = str.replace(/ ([\u4E00-\u9FCC])|([ -~]+ )/g, '$1$2');

console.log(str);

edited Jan 15 '19 at 17:04

answered Jan 14 '19 at 10:12

Kamil Kiełczewski

85,173
29
368
345

3

The space in front of the 10 is missing. – holydragon Jan 14 '19 at 10:13
@holydragon it's fixed now – Kamil Kiełczewski Jan 14 '19 at 10:28
@KamilKiełczewski No it isn't. It will still remove the space between another character and a Chinese character, not only spaces between two Chinese characters. And as the other answers (and your own link) show, the range you give does not include all Chinese characters. – fishinear Jan 15 '19 at 16:43
@fishinear can you show first problem by example (test-case) - because I don't understand – Kamil Kiełczewski Jan 15 '19 at 16:48
@Kamil It will remove the space after `London` in `的 London 多` (this is just an illustration - I don't speak Chinese, so this is probably not a valid Chinese sentence). – fishinear Jan 15 '19 at 16:52
@KamilKiełczewski `[ -~]` is a very limited range. What about other characters; non-English roman characters, Indian, Arabic, smileys, etc, etc? If you change the second character set to the negative of the Chinese, then it may work, but I think it is clearer to use a look-ahead, like other answers do. – fishinear Jan 15 '19 at 17:11

Younes Zaidi · Answer 5 · 2019-01-16T17:30:24.910

Another solution use match() Method With chinsese letters codes /[\u3400-\u9FBF]/ more details

str.match(/[\u3400-\u9FBF]/) // to detect if char is a chinese word

My Script to remove space between chinese char

var chine = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
//split the text by space
var spl = chine.trim().split(/\s+/);  //Output spl = ["請","把","這",'裡','的','10','多','個'...];
var result = '';
for (var i = 0; i < spl.length; i++) {
  //check if the current char is a chinese word and the next char is a chinese word if true we remove space between them
  if (spl[i].match(/[\u3400-\u9FBF]/) && spl[i+1].match(/[\u3400-\u9FBF]/)) 
     result += spl[i];     
   else 
     result += spl[i] + ' '; //if the current char is not a chinese word we use space between them
}
 console.log(result);

Using map() Function instead for

var chine = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
var result = '';
chine.split(/\s+/).map(function(item,i,elm) { 
if (item.match(/[\u3400-\u9FBF]/) && elm[i+1].match(/[\u3400-\u9FBF]/)) 
     result += item;     
   else 
     result += item + ' ';
})
 console.log(result);

A block of code with no explanation and negligible comments does not make an ideal answer. — Rich, Jan 14 '19 at 19:34

score 0 · Answer 6 · edited Jan 14 '19 at 12:40

0

This might be useful in your scenario. (?<![ -~]) (?![ -~])

edited Jan 14 '19 at 12:40

Sebastian Hofmann

1,440
6
15
21

answered Jan 14 '19 at 12:10

Shantanu Patwardhan

9
2

1

You need to explain what that does and why it is useful in this particular situation, and how it can be used to solve the OP's question. – fishinear Jan 15 '19 at 16:45

Remove all spaces between Chinese words with regex

6 Answers6

Linked

Related