How to embed English in right to left languages?

Question

I want to print a Persian phrase (right-to-left) in a Python console application. It's okay if all characters are in Persian. However, if it's mixed with English (including the ending dot (.), it shows the text in the wrong order.

Examples:

این خوب است # this is okay
این خوب است. 

# the dot must be on the left-most side, not the right-most side.
# the problem exists even in this editor.

این متن شامل English است.

The last one must be printed as:

.است English این متن شامل

To type the above, I typed it in the wrong order to show the right display order from right to left!

Python test with the same results:

>>> print("این خوب است")
این خوب است
>>> print("این خوب است.")
این خوب است.
>>> print("این متن شامل English است.")
این متن شامل English است.

Anyways, this answer seems like a solution, but it's in Java and I couldn't test it in Python. I tested it but it just prints some numbers inside the string.

Also to get the actual vs. display order (the must display order) check this website and copy and paste the third example in it, it gives the correct order, I just don't know how to use it in Python.

Could you provide a [mcve]? To me, they're all printed as written (although, that might be due to my language preferences in the terminal). — Ted Klein Bergman, Dec 16 '20 at 08:10
@TedKleinBergman I don't get what you mean, can you type the third example in the desired order? There are many hacks and codes to achieve that, it shouldn't be that easy. Anyways, test the java applet I addressed now in the question to get what I mean by actual order and display order. — Ahmad, Dec 16 '20 at 08:14

tripleee · Answer 1 · 2020-12-16T10:27:47.980

4

The information from the linked question IMHO trivially applies here as well.

print("\u202bاین خوب است.\u202c")

produces what I hope should be the correct output for you:

‫این خوب است.‬

Similarly,

print("\u202bاین متن شامل \u202aEnglish\u202c است.\u202c")

prints

‫این متن شامل ‪English‬ است.‬

Just in case the rendering of this question changes something, here is a screen shot:

And here's an analysis if the text I copy/pasted out of your question, to hopefully make sure I have the right code points:

>>> print(list("U+%04X" % ord(x) for x in "این خوب است"))
['U+0627', 'U+06CC', 'U+0646', 'U+0020', 'U+062E', 'U+0648', 'U+0628', 'U+0020', 'U+0627', 'U+0633', 'U+062A']

However, in fact,

print("\u202bاین متن شامل English است.\u202c")

without explicit embedding codes around the English text works fine, too:

‫این متن شامل English است.‬

I guess the trailing dot in the quoted string gets interpreted, somehow, somewhere, as part of a surrounding LTR context, and so only the actually Arabic text gets rendered RTL. Adding explicit directionality tags around the entire string helps force this by making it explicit, but generally speaking, it should not be necessary except where you have e.g. punctuation on the boundary of an embedding, i.e, you switch from LTR to RTL or vice versa around a piece of punctuation.

edited Dec 16 '20 at 10:27

answered Dec 16 '20 at 08:44

tripleee

175,061
34
275
318

I can't read Persian or Arabic so this is based on comparing my output with what is displayed in your question. If you can find a way to express your string as a sequence of unambiguous Unicode code points, that would perhaps make this easier to reason about and verify. – tripleee Dec 16 '20 at 08:49
thank you very much, I'm currently testing it in python console in Ubuntu 20 with no success, I also tested it as a program run in a terminal, again no success. Your inserted codes have no effect on the order and it still shows it in the wrong order ( as if they aren't ). Maybe the terminal doesn't support directional codes. – Ahmad Dec 16 '20 at 09:18
Try copy/pasting to a window where you know bidirectional rendering is working, like maybe a browser or word processor. Perhaps it's just that the terminal doesn't know what to do with the bidi codes. – tripleee Dec 16 '20 at 09:19
Did you check the last link I addressed? If you copy and paste the third example there, it shows the positions of each char in desired display order. Currently, I need it work in the console application I wrote. I think if I can get the order that the applet somway calculate, then I could print characters in that order one by one and achieve my goal. – Ahmad Dec 16 '20 at 09:21
This is copy/paste output here: این متن شامل ‪English‬ است.‬ – Ahmad Dec 16 '20 at 09:23
I tested your solution in Jupyter notebook and it works there! – Ahmad Dec 16 '20 at 09:26
This seems by and large outdated, but there is one answer from last year: https://askubuntu.com/questions/77657/how-to-enable-arabic-support-in-gnome-terminal – tripleee Dec 16 '20 at 09:26
Moreover, suppose your inserted bidi codes work, then how can I get the input from user or get a file and convert it to a string containing the bidi codes and then print it? Again I guess my best bet is to pass it to the mentioned library to reorder it in my desired one and then print it char by char in my ncurses application which uses addstr to print a character. – Ahmad Dec 16 '20 at 09:30
As I think I guess some replace are required in your solution to add bidi codes for dot and English chars. – Ahmad Dec 16 '20 at 09:32
That really sounds like a substantial enough additional requirement that you should probably post a new, separate question about that. Probably link back here. – tripleee Dec 16 '20 at 09:32
The problem exists in many editors and the whole problem statement is as above, I think it's clear and need no additional branches. The problem is just printing ANY text that contains English within non-English chars which has right to left directions, or generally, printing LTR beside RTL chars. They were just examples to show the problem. – Ahmad Dec 16 '20 at 09:35
1

I guess you probably don't need to meticulously mark up every embedding. The problem _I think_ with the final dot is that in `print("(arabic text).")` Python, or maybe the rendering engine, guesses that the dot is part of the LTR bidi context, and not the entire string which contains the RTL embedding, and so you have to mark that up explicitly. I updated the answer slitghtly to reflect this. – tripleee Dec 16 '20 at 09:50
Thanks, yeah it considers the directions of spaes or dot LTR. I even tried the output of Jupyter notebook without any additional bidi code and it shows it fine. So, the problem also depends on the rendering engine. – Ahmad Dec 16 '20 at 10:25

How to embed English in right to left languages?

1 Answers1