How .NET's regex engine treats RTL+LTR mixed strings?

Question

I have a mixed Hebrew/english string to parse. The string is built like this:

[3 hebrew] [2 english 2] [1 hebrew],

So, it can be read as: 1 2 3, and it is stored as 3 2 1 (exact byte sequence in file, double-checked in hex editor, and anyway RTL is only the display attribute). .NET regex parser has RTL option, which (when given for plain LTR text) starts processing from right side of the string.

I am wondering, when this option should be applied to extract [3 hebrew] and [2 english] parts from the string,or to check if [1 hebrew] matches the end of the string? Are there any hidden specifics or there's nothing to worry about (like when processing any LTR string with special unicode characters)?

Also, can anyone recommend me a good RTL+LTR text editor? (afraid that VS Express displays the text wrong sometimes, and if it can even start messing the saved strings - I would like to re-check the files without using hex editors anymore)

If you stored the string as 1 2 3 you could split the two strings, read them using RTL, then read the third string using the default option. The only way I know how to enable RTL support is enabled it within Windows. — Security Hound, Oct 20 '11 at 14:31
The script direction has nothing to do with this though. Regex's RightToLeft is a misnomer based on assumptions about left-to-right scripts, as I explain in my answer. — Jon Hanna, Oct 20 '11 at 14:46

score 3 · Accepted Answer · answered Oct 20 '11 at 14:39

The RightToLeft option refers to the order through the character sequence that the regular expression takes, and should really be called LastToFirst since in the case of Hebrew and Arabic it is actually left-to-right, and with mixed RLT and LTR text such as you describe the expression "right to left" is even less appropriate.

This has a minor effect on speed (will only matter if the searched text is massive) and on regular expressions that are done with a startAt index (searching those earlier in the string than startAt rather than later in the string).

Examples; let's hope the browers don't mess this up too much:

string saying = "למכות is in כתר"; //Just because it amuses me that this is a saying whatever way round the browser puts malkuth and kether.
string kether = "כתר";
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying));//True
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying));//True, perhaps minutely faster but so little that noise would hide it.
Console.WriteLine(new Regex(kether, RegexOptions.RightToLeft).IsMatch(saying, 2));//False
Console.WriteLine(new Regex(kether, RegexOptions.None).IsMatch(saying, 2));//True
//And to show that the ordering is codepoint rather than physical display ordering:
Console.WriteLine(new Regex("" + kether[0] + ".*" + kether[2]).IsMatch(saying));//True
Console.WriteLine(new Regex("" + kether[2] + ".*" + kether[0]).IsMatch(saying));//False

`Reverse` is better name.. but wait, for what reason this should be an option and not a function.. oh, http://stackoverflow.com/questions/228038/best-way-to-reverse-a-string-in-c-sharp-2-0 - they even have no reverse() in .NET. — kagali-san, Oct 20 '11 at 21:22

How .NET's regex engine treats RTL+LTR mixed strings?

1 Answers1