0

I'm trying to write the output of my script on a file.txt but when I write maybe Arabic characters the output on file is written from right to left.
this is my script:

import unicodedata
import sys
from tabulate import tabulate

headers=["Unicode Point", "Character in UTF-8 + length", "Character normalized + legth"]
data = []
f = open('multiplierNFD.txt', 'a', encoding='utf8')
for i in range (sys.maxunicode + 1):
  uni = chr(i)
  char8 = uni.encode('utf8', 'ignore').decode('utf8', 'ignore')
  char8norm = unicodedata.normalize('NFKC', char8)
  if len(char8) != len(char8norm):
    if i < 65535:
      str1 = "U+" + str(hex(i))[2:].rjust(4,'0')
    else:
      str1 = "U+" + str(hex(i))[2:].rjust(8,'0')
    str2 = char8 + " ---> " + str(len(char8))
    str3 = char8norm + " ---> " + str(len(char8norm))
    data.append([str1, str2, str3])
f.write(tabulate(data, headers=["Unicode Point", "Character in UTF-8 + length", "Character normalized + legth"]))

and this is the example of the output:

U+fb16           ﬖ ---> 1                       վն ---> 2
U+fb17           ﬗ ---> 1                       մխ ---> 2
U+fb1d           יִ ---> 1                       יִ ---> 2
U+fb1f           ײַ ---> 1                       ײַ ---> 2
U+fb2a           שׁ ---> 1                       שׁ ---> 2

How can I avoid this and print/save the output like in the first two lines?

Marco
  • 49
  • 2
  • 6
  • 1
    [This answer](https://stackoverflow.com/questions/29671593/how-to-print-arabic-characters-in-left-to-right-direction) should help. – evergreen May 13 '21 at 16:18

1 Answers1

1

Wrap the character in left-to-right overrides:

import unicodedata
import sys
from tabulate import tabulate

ltr = '\N{LEFT-TO-RIGHT OVERRIDE}'

headers=["Unicode", "Character + UTF-8 length", "NFKC + UTF-8 length"]
data = []
for i in range (sys.maxunicode + 1):
    uni = chr(i)
    nfkc = unicodedata.normalize('NFKC', uni)
    if len(uni) != len(nfkc):
        str1 = f'U+{i:04X}'
        str2 = f'{ltr}{uni}{ltr} ---> {len(uni.encode())}'
        str3 = f'{ltr}{nfkc}{ltr} ---> {len(nfkc.encode())}'
        data.append([str1, str2, str3])

with open('multiplierNFD.txt', 'w', encoding='utf8') as f:
    f.write(tabulate(data, headers=headers))

Sample of output:

Unicode    Character + UTF-8 length    NFKC + UTF-8 length
---------  --------------------------  --------------------------
...
U+FB16     ‭ﬖ‭ ---> 3                    ‭վն‭ ---> 4
U+FB17     ‭ﬗ‭ ---> 3                    ‭մխ‭ ---> 4
U+FB1D     ‭יִ‭ ---> 3                    ‭יִ‭ ---> 4
U+FB1F     ‭ײַ‭ ---> 3                    ‭ײַ‭ ---> 4
U+FB2A     ‭שׁ‭ ---> 3                    ‭שׁ‭ ---> 4
...

I also cleaned up the code a bit and output UTF-8 length like the headers say, instead of code point length. Don't confuse Unicode code points with UTF-8 encoding. For example, this does nothing:

char8 = uni.encode('utf8', 'ignore').decode('utf8', 'ignore')

All code points can be encoded in UTF8, so there is nothing to ignore, and decoding converts it back to the original character again, so uni == char8 in your code.

Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251