2

I want to perform replace string operations on Urdu language words, but the following code is not replacing آپ with aap. I am using word boundaries so that it replaces the entire words and not the parts of words.

var str ="آپ کا نام کیا ہے؟";
var res = str.replace(/\bآپ\b/g, "aap");
console.log(res);

I expect the following output:

 کا نام کیا ہے؟ aap
Muhammad Naufil
  • 2,420
  • 2
  • 17
  • 48
  • *"I expect the following output"* Because of the RTL vs. LTR aspect, the replace will put `aap` on the left, e.g.: `aap کا نام کیا ہے؟`. – T.J. Crowder Jun 29 '19 at 10:18

3 Answers3

1

Try without \b, like this

var str ="آپ کا نام کیا ہے؟";
   var res = str.replace(/(^|\s)آپ(?=\s|$)/g, "aap");
console.log(res);
Ghoul Ahmed
  • 4,446
  • 1
  • 14
  • 23
  • I am using word boundaries so that it replaces the entire words and not the parts of words. We cant remove word boundaries – Muhammad Naufil Jun 29 '19 at 09:17
  • @MuhammadNaufil, (^|\s)آپ(?=\s|$) same of boundaries b, try now – Ghoul Ahmed Jun 29 '19 at 09:30
  • @MuhammadNaufil, boundaries \b doesn't work for unicode characters [Word boundaries + unicode characters](https://stackoverflow.com/questions/10590098/javascript-regexp-word-boundaries-unicode-characters) – Ghoul Ahmed Jun 29 '19 at 09:34
  • @GhoulAhmed - Yes they do, but only for English-centric ones (although not very well :-) ). (Remember, **all** characters are "Unicode characters".) – T.J. Crowder Jun 29 '19 at 10:02
1

\b is English-centric, I'm afraid, and not actually that good at even being English-centric. :-) (For instance, it would match at the end of "English" in "English-centric".)

You can use lookarounds with a negated Unicode "letter" category to check for word boundaries. Those features exist in the most recent JavaScript spec, but support is very spotty. You can throw a library at it, though: XRegExp by Steven Levithan:

var str ="آپ کا نام کیا ہے؟";
var rex = XRegExp("(?<=^|[^\\p{Letter}])آپ(?=$|[^\\p{Letter}])", "g");
var res = str.replace(rex, "aap");
console.log(res);
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.2.0/xregexp-all.min.js"></script>

In that regular expression:

  • (?<=^|[^\p{Letter}]) is a look-behind for start of input or a non-letter per the Unicode standard. (Note that the \ has to be escaped inside the string we pass XRegExp so the XRegExp receives it, since \ is an escape in string literals.)
  • آپ is the word
  • (?=$|[^\p{Letter}]) is a look-ahead for the end of input or a non-letter. (Again, with the \ escaped in the string.)

As I mentioned in my comment, because of the right-to-left (RTL) vs. left-to-right (LTR) language script difference (e.g., Arabic script vs. Latin script), that shows up as aap کا نام کیا ہے؟ rather than your expected output, even though the text was replaced in the right place, because the Urdu word is at the beginning of the string (but when rendered, all of the Arabic script is output from right-to-left). So in the updated string, the Latin script (app) is output left-to-right, followed by the Arabic script right-to-left.

In a really up-to-date JavaScript engine, you could do it natively:

var str ="آپ کا نام کیا ہے؟";
var rex = /(?<=^|[^\p{Letter}])آپ(?=$|[^\p{Letter}])/g;
var res = str.replace(rex, "aap");
console.log(res);
<script src="https://cdnjs.cloudflare.com/ajax/libs/xregexp/3.2.0/xregexp-all.min.js"></script>

That works in the version of V8 in Chrome v75 and Node.js v12.4, for instance.

(Side note: With XRegExp, you could use the shorthand \pL instead of \p{Letter}, but not with JavaScript's own regular expressions.)

T.J. Crowder
  • 1,031,962
  • 187
  • 1,923
  • 1,875
0

I'm not so sure if this expression,

(?=\s|)(آپ)(?=\s|$)

might be close to what we might want here, yet it'd be an option maybe.

In this demo, the expression is explained.

Test

const regex = /(?=\s|)(آپ)(?=\s|$)/gm;
const str = `آپ
آپ کا نام کیا ہے؟
آپ کا نام کیا ہے؟ آپ کا نام کیا ہے؟
آپکاآپکا نام کیا ہے؟آپکا نام کیا ہے؟`;
const subst = `app`;

console.log(str.replace(regex, subst));
Community
  • 1
  • 1
Emma
  • 27,428
  • 11
  • 44
  • 69