replace/replaceAll with regex on unicode issues

Question

Is there a way to apply the replace method on Unicode text in general (Arabic is of concern here)? In the example below, whereas replacing the entire word works nicely on the English text, it fails to detect and as a result, replace the Arabic word. I added the u as a flag to enable unicode parsing but that didn't help. In the Arabic example below, the word النجوم should be replaced, but not والنجوم, but this doesn't happen.

<!DOCTYPE html>
<html>
<body>
<p>Click to replace...</p>
<button onclick="myFunction()">replace</button>
<p id="demo"></p>
<script>
function myFunction() {
  var str = "الشمس والقمر والنجوم، ثم النجوم والنهار";
  var rep = 'النجوم';
  var repWith = 'الليل';

  //var str = "the sun and the stars, then the starsz and the day";
  //var rep = 'stars';
  //var repWith = 'night';

  var result = str.replace(new RegExp("\\b"+rep+"\\b", "ug"), repWith);
  document.getElementById("demo").innerHTML = result;
}
</script>
</body>
</html>

And, whatever solution you could offer, please keep it with the use of variables as you see in the code above (the variable rep above), as these replace words being sought are passed in through function calls.

UPDATE: To try the above code, replace code in here with the code above.

In JS, regex Word boundary are problematic with Unicode. Try to take out the `\b`. — , Dec 24 '17 at 19:00
@sin. Oh I tried, no good. I posted a link where you could try that yourself. — mohsenmadi, Dec 24 '17 at 19:12
If you need to work with Unicode, I think you should consider using XRegExp library. See [this JSFiddle](https://jsfiddle.net/m6gvrj21/1/). The result I got is `الشمس والقمر والنجوم، ثم الليل والنهار` — Wiktor Stribiżew, Dec 24 '17 at 19:21
This is a dirty solution, but how about something like: `/(^|[^a-zA-ZΆΈ-ώἀ-ῼ\n])النجوم(?![a-zA-ZΆΈ-ώἀ-ῼ])/` -- Inspired by [this answer](https://stackoverflow.com/a/23458918/1954610), I'm explicitly looking for unicode character ranges, rather than relying on word boundaries as JavaScript doesn't support them in the context of unicode. — Tom Lord, Dec 24 '17 at 19:25
...However, since you actually just need to look for **Arabic** characters, you should refine that regex to only include the chars you need. A [quick google search](https://stackoverflow.com/a/29729405/1954610) reveals `[\u0621-\u064A\u0660-\u0669 ]` may work? Not fully tested/researched, though... — Tom Lord, Dec 24 '17 at 19:30
Note that the `'u'` flag is for ECMAScript 6. All it does is recognize the Unicode constructs like `\u2092` etc. However, I don't think stuff like word boundary's are Unicode aware. See https://mathiasbynens.be/notes/es6-unicode-regex — , Dec 24 '17 at 19:41
It's a nasty regex to simulate word boundary's in JS via the Unicode route. Especially since JS is so lame it won't do lookbehind assertions, so the first character has to be matched, paired with a lookahead right after it... just nasty. I've done it before, it covers all langs, but you wouldn't like it. — , Dec 24 '17 at 19:44
A better approach would be to simulate whitespace boundarys instead. — , Dec 24 '17 at 19:46
Thank you all for your answers. I am in and out for now but will test all your suggestions. For the last one, I thought about it @sin, but white space is not good if the word is at $ or ^. A better one is a test with `indexOf`, and a comparison on the length for the matches to guarantee a whole match. — mohsenmadi, Dec 24 '17 at 22:13
`but white space is not good if the word is at $ or ^` A negative of a negative is a positive. This applies to a negative class inside a negative assertion. The anchors `^$` always match a negative class since anchors cannot exist in a class. Therefore, `(?![^anything])` will always match the absolute end of string and `(?<![^anything])` the beginning. — , Dec 26 '17 at 16:33
Note this shorthand for a word boundary `(?:(?:^|(?<=\W))(?=\w)|(?<=\w)(?:$|(?=\W)))`. In JS, this will translate into a very complex substitution using the `\uDDDD` notation since JS only knows UTF-16 notation, and does not know what a Unicode word boundary is. To think this will cover all of Unicode is unrealistic. I do have the regex that covers all of Unicode, but it is hairy. If it's something you have to have, let me know. And beware, this `[\pL0-9_]` does _not_ represent all of Unicode word characters, it ignores about 3,000 valid words. — , Dec 26 '17 at 16:50

score 3 · Accepted Answer · edited Dec 29 '17 at 05:26

A \bword\b pattern can be represented as (^|[A-Za-z0-9_])word(?![A-Za-z0-9_]) pattern and when you need to replace the match, you need to add $1 before the replacement pattern.

Since you need to work with Unicode, it makes sense to utilize XRegExp library that supports a "shorthand" \pL notation for any base Unicode letter. You may replace A-Za-z in the above pattern with this \pL:

var str = "الشمس والقمر والنجوم، ثم النجوم والنهار";
var rep = 'النجوم';
var repWith = 'الليل';

var regex = new XRegExp('(^|[^\\pL0-9_])' + rep + '(?![\\pL0-9_])');
var result = XRegExp.replace(str, regex, '$1' + repWith, 'all');
console.log(result);

<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.1.1/xregexp-all.min.js"></script>

UPDATE by @mohsenmadi: To integrate in an Angular app, follow these steps:

Issue an npm install xregexp to add the library to package.json
Inside a component, add an import { replace, build } from 'xregexp/xregexp-all.js';
Build the regex with: let regex = build('(^|[^\\pL0-9_])' + rep + '(?![\\pL0-9_])');
Replace with: let result = replace(str, regex, '$1' + repWith, 'all');

Many thanks for this solution! I didn't know about XRegExp. I just tried it and it works. I even wanted to try a "replaceAll" operation and all that needs to be done is to add the argument `'all'` to the `XRegExp.replace()` call as in http://xregexp.com/api/#replace. I need to integrate this solution into an Angular app - I hope this goes smooth. I will accept as an answer after some more research. Thank you. — mohsenmadi, Dec 24 '17 at 22:31

score 2 · Answer 2 · 2017-12-26T17:09:59.633

Incase you change your mind about whitespace boundary's, here is the regex.

var Rx = new RegExp(
   "(^|[\\u0009-\\u000D\\u0020\\u0085\\u00A0\\u1680\\u2000-\\u200A\\u2028-\\u2029\\u202F\\u205F\\u3000])"
   + text +
   "(?![^\\u0009-\\u000D\\u0020\\u0085\\u00A0\\u1680\\u2000-\\u200A\\u2028-\\u2029\\u202F\\u205F\\u3000])"
   ,"ug");

var result = str.replace( Rx, '$1' + repWith );

Regex explanation

 (                             # (1 start), simulated whitespace boundary
      ^                             # BOL
   |                              # or whitespace
      [\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028-\u2029\u202F\u205F\u3000] 
 )                             # (1 end)

 text                          # To find

 (?!                           # Whitespace boundary
      [^\u0009-\u000D\u0020\u0085\u00A0\u1680\u2000-\u200A\u2028-\u2029\u202F\u205F\u3000] 
 )

In an engine that can use lookbehind assertions, a whitespace boundary
is typically done like this (?<!\S)text(?!\S).

Thank you! This helps in other aspects in how to construct my ranges for Arabic text too. — mohsenmadi, Jan 05 '18 at 20:03

replace/replaceAll with regex on unicode issues

2 Answers2

Linked