How do remove all Unicode from string, BUT keep lanauges such as: Japanese, Greek, Hindi etc

Question

How would I remove all Unicode from this string【Hello!】★ ああああ I need to remove all the "weird" symbols (【, ★, 】) and keep "Hello!" and "ああああ". This needs to work for all languages not just Japanese.

are weird symbols `【` , `★`, and `】` in this case? do we need to consider other symbols? — kuromoka, Sep 30 '18 at 03:20

Davislor · Answer 1 · 2018-09-30T03:51:01.220

1

You want to remove characters within the Unicode categories Other Symbol, Combining Symbol, and Enclosing Mark, but leave those from other categories.

Using regular expressions, those match the classes \p{So}, \p{Sk} and \p{Me}, respectively. You might for example use XRegExp.replace().

edited Sep 30 '18 at 03:51

answered Sep 30 '18 at 03:21

Davislor

14,674
2
34
49

There is a regular expression with PHP which looks like `\p{common}`, this would work, *but* this is PHP, I need JavaScript. Same goes with yours. – BurstingKitten Sep 30 '18 at 03:41
There are [regex libraries for JavaScript that support categories.](https://regular-expressions.mobi/xregexp.html) – Davislor Sep 30 '18 at 03:48

score -1 · Answer 2 · answered Sep 30 '18 at 04:13

-1

I have found a solution. Using XRegEXP, I was able to use PHP's \p{Common} in node.

const xreg = require('xregexp');

let str = '【Hello!】★ ああああ】';
let regex = new xreg('\\p{Common}', 'g');
let res = xreg.replace(str, regex, ' ');

console.log(res); // Hello    ああああ

answered Sep 30 '18 at 04:13

BurstingKitten

13
2

Side note: PHP borrowed that syntax from PCRE, which stands for Perl-Compatible Regular Expressions. (Its syntax is not really Perl-compatible.) – Davislor Sep 30 '18 at 09:21

How do remove all Unicode from string, BUT keep lanauges such as: Japanese, Greek, Hindi etc

2 Answers2