4

I am dealing with pattern matching of url strings containing categories in arabic language.

For example, in english, whenever I see something like the following:

matching pattern -> (.*)/Store/SomeThing/(.*)

I replace it with this pattern-> $1/store/something

so that this

http://baseurl.com/en-gb/Store/SomeThing/WhatEver

could be without "whatever" and become like

http://baseurl.com/en-gb/store/something

Now, how can I do something like this in arabic language ?

for example, here are my tests:

1) Test urls to match:

  • 1a) http://baseurl.com/ar-gb/Store/عرمنتجات/عرع

  • 1b) http://baseurl.com/ar-gb/Store/عرع/عرمنتجات

How to cut everything coming after عرمنتجات, regardless of the fact that since "/" is also an arabic character, it is handled just like other arabic letters?

2) Matching patterns under test:

  • 2a) (.*)/Store/عرمنتجات/(.*)

  • 2b) (.*)/Store/(.*)/عرمنتجات

  • 2c) (.*)/Store/عرمنتجات

:::: TEST RESULTS ::::

During my tests

  • (1a) matched with (2a) and (2c), which looks very strange for both

  • (1b) matched with (2b) but is strange, I would have assumed (2a) to work with it but doesn't

Long story short, what is the equivalent of this pattern matching: - (.*)/Store/SomeThing/(.*) in arabic language, considering SomeThing being written in Arabic ?

Cris
  • 2,002
  • 4
  • 30
  • 51
  • 1
    You need to say what language / regex engine you are using. – Tomalak May 28 '18 at 16:30
  • 1
    Cutting everything after `عرمنتجات` should be as easy as doing with with Latin based scripts. `Regex.Replace(s, @"عرمنتجات.*", "")` or (if the text must remain) `Regex.Replace(s, @"(?<=عرمنتجات).*", "")` – Wiktor Stribiżew May 28 '18 at 18:31
  • 3
    From the description of the regex tag: "Since regular expressions are not fully standardized, all questions with this tag should also include a tag specifying the applicable programming language or tool." The results you're getting may be different when using the same regex in another engine. – NickL May 28 '18 at 20:17
  • Possible duplicate of [Unicode characters in Regex](https://stackoverflow.com/questions/20641297/unicode-characters-in-regex) – Andrea Corbellini May 29 '18 at 14:45
  • This is not a question about Unicode characters in regex but about right-to-left languages. It’s about the ordering, not the characters. Thanks – Cris May 29 '18 at 14:54

1 Answers1

0

This behavior might seem odd, but all 'control characters' (not sure what the full list of them would be) that are embedded in arabic letters are also rendered right-to-left. Look at the byte code for your two examples:

/Store/عرمنتجات/عرع
 2F53746F72652F   D8B9D8B1D985D986D8AAD8ACD8A7D8AA   2F   D8B9D8B1D8B9
|--------------| |--------------------------------| |--| |------------|
  "/Store/"                   عرمنتجات               /  i    عرع

/Store/عرع/عرمنتجات
 2F53746F72652F   D8B9D8B1D8B9   2F   D8B9D8B1D985D986D8AAD8ACD8A7D8AA
|--------------| |------------| |--| |--------------------------------|
  "/Store/"           عرع        /  i              عرمنتجات

/Store/عرمنتجات/whatever
2F53746F72652F D8B9D8B1D985D986D8AAD8ACD8A7D8AA  2F  7768617465766572
|------------| |------------------------------| |--| |--------------|
  "/Store/"                عرمنتجات              /        whatever

(note: The i is just there to prevent the rendering I try to explain here)

This explains also your test results. In particular it depends of whether the / has a latin letter adjacent or not.

Snow bunting
  • 1,120
  • 8
  • 28