LTR and RTL String Split Irrespective of the Language Directionality

Question

Long story short, I have a text in the following manner:

const string example1 = "good בוקר טוב morning";

I am trying to split the above text into 4 words (in my case tokens) for processing. No matter what approach I take I still get the 2 Hebrew words in the wrong order. By order I mean the display order and not the logical order.

When using String.Split for example1

Returned

good
בוקר
טוב
morning

In the above you may notice the 2 Hebrew words are now in the wrong (display) position.

Expected

good
טוב
בוקר
morning

I have tried using Regex.Split, forcing invariant cultures and even appending Unicode characters for splitting in LTR order of display.

Though I am not keen on solving this via Regex I am looking for a generic approach to such problem that would stay intact in other RTL languages say Arabic.

Before making this post I have checked the following in case it gets marked as duplicate by others.

Parsing through Arabic / RTL text from left to right - Presumes that the developer knows where the RTL word is located. I am not using a semi-colon as a word separator. It can be any culture specific character. Also, adding an invisible marker would result in future string equality comparison failure.

c# split and revers sentence with two languages - The character order is modified. I wish to keep them intact.

Any advice or suggestion that puts me in the right direction is appreciated.

Is your example there the order you get them in? Or the order you want them to be in? — Matt Burland, Jun 07 '17 at 13:15
@MattBurland Sorry I didn't add that in there. It's the order I am getting it in. I shall update the post to avoid further confusions. — Nathan, Jun 07 '17 at 13:15
@MattBurland Marking duplicate in seconds proves for itself you didn't actually care to read the post. It is not a duplicate of what you've suggested. In that post the resolution provided presumes that the developer knows where the RTL words are located and has to explicitly add a Left-to-Right marker (\U+200E) for each RTL word. — Nathan, Jun 07 '17 at 13:43

LTR and RTL String Split Irrespective of the Language Directionality

0 Answers0