Regex pattern matching in right-to-left languages

Question

I am dealing with pattern matching of url strings containing categories in arabic language.

For example, in english, whenever I see something like the following:

matching pattern -> (.*)/Store/SomeThing/(.*)

I replace it with this pattern-> $1/store/something

so that this

http://baseurl.com/en-gb/Store/SomeThing/WhatEver

could be without "whatever" and become like

http://baseurl.com/en-gb/store/something

Now, how can I do something like this in arabic language ?

for example, here are my tests:

1) Test urls to match:

1a) http://baseurl.com/ar-gb/Store/عرمنتجات/عرع
1b) http://baseurl.com/ar-gb/Store/عرع/عرمنتجات

How to cut everything coming after عرمنتجات, regardless of the fact that since "/" is also an arabic character, it is handled just like other arabic letters?

2) Matching patterns under test:

2a) (.*)/Store/عرمنتجات/(.*)
2b) (.*)/Store/(.*)/عرمنتجات
2c) (.*)/Store/عرمنتجات

:::: TEST RESULTS ::::

During my tests

(1a) matched with (2a) and (2c), which looks very strange for both
(1b) matched with (2b) but is strange, I would have assumed (2a) to work with it but doesn't

Long story short, what is the equivalent of this pattern matching: - (.*)/Store/SomeThing/(.*) in arabic language, considering SomeThing being written in Arabic ?

Cutting everything after `عرمنتجات` should be as easy as doing with with Latin based scripts. `Regex.Replace(s, @"عرمنتجات.*", "")` or (if the text must remain) `Regex.Replace(s, @"(?<=عرمنتجات).*", "")` — Wiktor Stribiżew, May 28 '18 at 18:31
From the description of the regex tag: "Since regular expressions are not fully standardized, all questions with this tag should also include a tag specifying the applicable programming language or tool." The results you're getting may be different when using the same regex in another engine. — NickL, May 28 '18 at 20:17
Possible duplicate of [Unicode characters in Regex](https://stackoverflow.com/questions/20641297/unicode-characters-in-regex) — Andrea Corbellini, May 29 '18 at 14:45
This is not a question about Unicode characters in regex but about right-to-left languages. It’s about the ordering, not the characters. Thanks — Cris, May 29 '18 at 14:54

score 0 · Answer 1 · answered Oct 04 '18 at 07:17

This behavior might seem odd, but all 'control characters' (not sure what the full list of them would be) that are embedded in arabic letters are also rendered right-to-left. Look at the byte code for your two examples:

/Store/عرمنتجات/عرع
 2F53746F72652F   D8B9D8B1D985D986D8AAD8ACD8A7D8AA   2F   D8B9D8B1D8B9
|--------------| |--------------------------------| |--| |------------|
  "/Store/"                   عرمنتجات               /  i    عرع

/Store/عرع/عرمنتجات
 2F53746F72652F   D8B9D8B1D8B9   2F   D8B9D8B1D985D986D8AAD8ACD8A7D8AA
|--------------| |------------| |--| |--------------------------------|
  "/Store/"           عرع        /  i              عرمنتجات

/Store/عرمنتجات/whatever
2F53746F72652F D8B9D8B1D985D986D8AAD8ACD8A7D8AA  2F  7768617465766572
|------------| |------------------------------| |--| |--------------|
  "/Store/"                عرمنتجات              /        whatever

(note: The i is just there to prevent the rendering I try to explain here)

This explains also your test results. In particular it depends of whether the / has a latin letter adjacent or not.

Regex pattern matching in right-to-left languages

1 Answers1