42

I would like to remove all spaces among Chinese text only.

My text: "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?"

Ideal output: "請把這裡的 10 多個字合併. Can you help me?"

var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
str = str.replace("/\ /", "");

I have studied a similar question for Python but it seems not to work in my situation so I brought my question here for some help.

Boann
  • 48,794
  • 16
  • 117
  • 146
lewishole
  • 551
  • 4
  • 11

6 Answers6

33

Getting to the Chinese char matching pattern

Using the Unicode Tools, the \p{Han} Unicode property class that matches any Chinese char can be translated into

[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\U00020000-\U0002A6D6\U0002A700-\U0002B734\U0002B740-\U0002B81D\U0002B820-\U0002CEA1\U0002CEB0-\U0002EBE0\U0002F800-\U0002FA1D]

In ES6, to match a single Chinese char, it can be used as

/[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9\u{20000}-\u{2A6D6}\u{2A700}-\u{2B734}\u{2B740}-\u{2B81D}\u{2B820}-\u{2CEA1}\u{2CEB0}-\u{2EBE0}\u{2F800}-\u{2FA1D}]/u

Transpiling it to ES5 using ES2015 Unicode regular expression transpiler, we get

(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])

pattern to match any Chinese char using JS RegExp.

So, you may use

s.replace(/([\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D])\s+(?=(?:[\u2E80-\u2E99\u2E9B-\u2EF3\u2F00-\u2FD5\u3005\u3007\u3021-\u3029\u3038-\u303B\u3400-\u4DB5\u4E00-\u9FEF\uF900-\uFA6D\uFA70-\uFAD9]|[\uD840-\uD868\uD86A-\uD86C\uD86F-\uD872\uD874-\uD879][\uDC00-\uDFFF]|\uD869[\uDC00-\uDED6\uDF00-\uDFFF]|\uD86D[\uDC00-\uDF34\uDF40-\uDFFF]|\uD86E[\uDC00-\uDC1D\uDC20-\uDFFF]|\uD873[\uDC00-\uDEA1\uDEB0-\uDFFF]|\uD87A[\uDC00-\uDFE0]|\uD87E[\uDC00-\uDE1D]))/g, '$1')

See the regex demo.

If your JS environment is ECMAScript 2018 compliant you may use a shorter

s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1')

Pattern details

  • (CHINESE_CHAR_PATTERN) - Capturing group 1 ($1 in the replacement pattern): any Chinese char
  • \s+ - any 1+ whitespaces (any Unicode whitespace)
  • (?=CHINESE_CHAR_PATTERN) - there must be a Chinese char immediately to the right of the current location.

JS demo:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?";
var HanChr = "[\\u2E80-\\u2E99\\u2E9B-\\u2EF3\\u2F00-\\u2FD5\\u3005\\u3007\\u3021-\\u3029\\u3038-\\u303B\\u3400-\\u4DB5\\u4E00-\\u9FEF\\uF900-\\uFA6D\\uFA70-\\uFAD9]|[\\uD840-\\uD868\\uD86A-\\uD86C\\uD86F-\\uD872\\uD874-\\uD879][\\uDC00-\\uDFFF]|\\uD869[\\uDC00-\\uDED6\\uDF00-\\uDFFF]|\\uD86D[\\uDC00-\\uDF34\\uDF40-\\uDFFF]|\\uD86E[\\uDC00-\\uDC1D\\uDC20-\\uDFFF]|\\uD873[\\uDC00-\\uDEA1\\uDEB0-\\uDFFF]|\\uD87A[\\uDC00-\\uDFE0]|\\uD87E[\\uDC00-\\uDE1D]"; 
console.log(s.replace(new RegExp('(' + HanChr + ')\\s+(?=(?:' + HanChr + '))', 'g'), '$1'));

A test for the regex compliant with the ECMAScript 2018 standard:

var s = "請 把 這 裡 的 10 多 個 字 合 併. Can you help me?";
console.log(s.replace(/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu, '$1'));
Wiktor Stribiżew
  • 607,720
  • 39
  • 448
  • 563
  • FYI: if only one whitespace is expected between Chinese chars, remove `+` after `\s`. – Wiktor Stribiżew Jan 14 '19 at 10:33
  • I get " { "message": "SyntaxError: invalid identity escape in regular expression", "filename": "https://stacksnippets.net/js", "lineno": 17, "colno": 22 }" When I run the snippet. (Using Firefox 62) – Pac0 Jan 14 '19 at 19:55
  • @Pac0 firefox has problems with "new" regexp e.g. [here](https://bugzilla.mozilla.org/show_bug.cgi?id=1361876) – Kamil Kiełczewski Jan 14 '19 at 20:11
  • 3
    @Pac0 That is because of `/(\p{Script=Hani})\s+(?=\p{Script=Hani})/gu`, FF does not support ECMAScript 2018 Unicode property classes. Chrome does. – Wiktor Stribiżew Jan 14 '19 at 20:46
  • 1
    Thanks Wiktor, I have compared the answers. And this seems would be the most detailed and worked answer to my question. – lewishole Jan 15 '19 at 02:36
  • Problem with this is more Chinese characters will be added and this will end up not matching all of them. It might be a better idea to detect ES 2018 support and post the string for server-side processing otherwise. – billc.cn Jan 18 '19 at 17:31
  • @billc.cn Or, just follow the above process to update the regex. – Wiktor Stribiżew Jan 18 '19 at 18:18
22

Using @Brett Zamir soluce on how to match chinese character in regex

Javascript unicode string, chinese character but no punctuation


const str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';

const regex = new RegExp('([\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d]) ([\u4E00-\u9FCC\u3400-\u4DB5\uFA0E\uFA0F\uFA11\uFA13\uFA14\uFA1F\uFA21\uFA23\uFA24\uFA27-\uFA29]|[\ud840-\ud868][\udc00-\udfff]|\ud869[\udc00-\uded6\udf00-\udfff]|[\ud86a-\ud86c][\udc00-\udfff]|\ud86d[\udc00-\udf34\udf40-\udfff]|\ud86e[\udc00-\udc1d])* ', 'g');

const ret = str.replace(regex, '$1$2');

console.log(ret);

It looks like :

([foo chinese chars]) ([foo chinese chars])*
Orelsanpls
  • 22,456
  • 6
  • 42
  • 69
  • 2
    The output here doesn't match with the ideal output. Notice the space in front of the 10. – holydragon Jan 14 '19 at 10:10
  • you lose the space before the 10 at the center of the chineses word but still you found the right way to select chinese characters :p – jonatjano Jan 14 '19 at 10:11
  • I'd use `\s+` instead of `' '` – yunzen Jan 14 '19 at 10:16
  • Thanks NEUT!! This answer is near what I need. But I am wondering some text stil not working. How can I make it fine? Example(1) Text: "最 新消 息" will changed to "最消息" Example(2) "最新消 息" does nothing. – lewishole Jan 14 '19 at 11:13
  • 2
    @GrégoryNEUT `blabla` isn't a common [metasyntactic variable](https://en.wikipedia.org/wiki/Metasyntactic_variable) in English, you might want to use `foo` instead ;) – Aaron Jan 14 '19 at 17:14
  • 2
    This answer does not work when there are an even number of Chinese characters before the other text, such as the case that @bobblebubble mentions: `請 的 10 多 個 a`. It will remove too many spaces in that case. – fishinear Jan 14 '19 at 18:54
  • 1
    You should *really* break this down into simpler sub expressions. Some poor son of a bitch is going to have to debug that in 8 years. – Alexander Jan 15 '19 at 02:15
11

Range for Chinese characters can be written as [\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC] so you can use this regex which selects a chinese character and a space and ensures it is followed by a chinese character by this look ahead (?=[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+),

([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)\s+(?=[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)

And replace it by $1

Demo

var str = '請 把把把把把 這 裡裡裡裡裡 的 10 多多多多 個 字 合 併. Can you help me?';
console.log(str.replace(/([\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)\s+(?=[\u2E80-\u2FD5\u3190-\u319f\u3400-\u4DBF\u4E00-\u9FCC]+)/g, "$1"));
Pushpesh Kumar Rajwanshi
  • 18,127
  • 2
  • 19
  • 36
4

Try this

str.replace(/ ([\u4E00-\u9FCC])|([ -~]+ )/g, '$1$2');

Solution works witch ascii characters and chinsese letters with codes \u4E00-\u9FCC (I get them from here - it contains ~20000 chars enough for daily usage but not all Chinese letters).

var str = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
str = str.replace(/ ([\u4E00-\u9FCC])|([ -~]+ )/g, '$1$2');

console.log(str);
Kamil Kiełczewski
  • 85,173
  • 29
  • 368
  • 345
  • 3
    The space in front of the 10 is missing. – holydragon Jan 14 '19 at 10:13
  • @holydragon it's fixed now – Kamil Kiełczewski Jan 14 '19 at 10:28
  • @KamilKiełczewski No it isn't. It will still remove the space between another character and a Chinese character, not only spaces between two Chinese characters. And as the other answers (and your own link) show, the range you give does not include all Chinese characters. – fishinear Jan 15 '19 at 16:43
  • @fishinear can you show first problem by example (test-case) - because I don't understand – Kamil Kiełczewski Jan 15 '19 at 16:48
  • @Kamil It will remove the space after `London` in `的 London 多` (this is just an illustration - I don't speak Chinese, so this is probably not a valid Chinese sentence). – fishinear Jan 15 '19 at 16:52
  • @KamilKiełczewski `[ -~]` is a very limited range. What about other characters; non-English roman characters, Indian, Arabic, smileys, etc, etc? If you change the second character set to the negative of the Chinese, then it may work, but I think it is clearer to use a look-ahead, like other answers do. – fishinear Jan 15 '19 at 17:11
0

Another solution use match() Method With chinsese letters codes /[\u3400-\u9FBF]/ more details

str.match(/[\u3400-\u9FBF]/) // to detect if char is a chinese word

My Script to remove space between chinese char

var chine = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
//split the text by space
var spl = chine.trim().split(/\s+/);  //Output spl = ["請","把","這",'裡','的','10','多','個'...];
var result = '';
for (var i = 0; i < spl.length; i++) {
  //check if the current char is a chinese word and the next char is a chinese word if true we remove space between them
  if (spl[i].match(/[\u3400-\u9FBF]/) && spl[i+1].match(/[\u3400-\u9FBF]/)) 
     result += spl[i];     
   else 
     result += spl[i] + ' '; //if the current char is not a chinese word we use space between them
}
 console.log(result);
  • Using map() Function instead for

var chine = '請 把 這 裡 的 10 多 個 字 合 併. Can you help me?';
var result = '';
chine.split(/\s+/).map(function(item,i,elm) { 
if (item.match(/[\u3400-\u9FBF]/) && elm[i+1].match(/[\u3400-\u9FBF]/)) 
     result += item;     
   else 
     result += item + ' ';
})
 console.log(result);
Younes Zaidi
  • 1,180
  • 1
  • 8
  • 25
  • 1
    A block of code with no explanation and negligible comments does not make an ideal answer. – Rich Jan 14 '19 at 19:34
0

This might be useful in your scenario. (?<![ -~]) (?![ -~])

Sebastian Hofmann
  • 1,440
  • 6
  • 15
  • 21
  • 1
    You need to explain what that does and why it is useful in this particular situation, and how it can be used to solve the OP's question. – fishinear Jan 15 '19 at 16:45