0

This question is relative to this one. But as my tried solution does not work, I open a new question to deal with my specific problems.

Context:

In the application I develop, I need to build python regex that includes unicodes, possibly in the whole range(0, 0x110000). When I build my regex, for example with the following:

regex += mycodepoint_as_char + ".{0," + str(max_repeat) + "}"

I observes that for some code points, the order is reversed as if I had written:

regex += "{0," + str(max_repeat) + "}." + mycodepoint_as_char
regex =  ή.{0,2}{0,3}.䝆⚭.{0,3}俩.{0,4}ⷭ

In other cases, I have an exception.

So I studied the norm for biderectional unicode and some Q/A that explain surrogate pairs, Left-To-Right and Right-To-Left special code points, and some prohibited ones reserved for UTF-16.

My problem:

Then I have decided to test all of them, and to build a list of RTL ones and prohibited ones, assuming the first would change the order in the string, and that the last would raise an exception.

Here is my test code:

#!/usr/bin/python3

import sys
import os
import unicodedata #https://docs.python.org/fr/3/library/unicodedata.html, https://fr.wikipedia.org/wiki/Normalisation_Unicode


def group_consecutive(l):
    res = []
    i1 = 0
    i2 = 0
    while i1 < len(l):
        while i2 + 1 < len(l) and l[i2+1] == l[i2] + 1:
            i2 += 1
        res.append((i1, i2+1)) # range(i1, i2+1) has consecutive values
        i1 = i2+1
        i2 = i1
    return res
        
    
def id_rtl_code_points():
    oldstdout = sys.stdout # https://stackoverflow.com/questions/8777152/unable-to-restore-stdout-to-original-only-to-terminal
    nullstdout = open(os.devnull, 'w') # https://stackoverflow.com/questions/26837247/how-to-disable-print-statements-conveniently-so-that-pythonw-can-run?noredirect=1&lq=1
    forbiddenCP = []
    sep = 'a' # choose a letter that can receive modifiers
    s = ""
    for i in range(0, 0x110000):
        if i%0x10000 == 0:
            print(hex(i) + "-------------") # show progress
        try:
            if len(s) % 2 == 1: #keep synchronised, sep on modulo = 0, chr(i) on modulo = 1
                s += sep
            #sys.stdout = nullstdout
            print(hex(i), " : " + sep + chr(i)) # without print, no error
        except:
            forbiddenCP.append(i)
        else:
            s += sep + chr(i)
        finally:
            pass
            #sys.stdout = oldstdout
    s += sep
    rtlCP = []
    for i in range(0, 0x110000,2):
        if s[i] != sep: #not sure at all this algorythm is right
            rtlCP.append(ord(s[i]))
    sys.stdout = oldstdout
    #print("id_rtl_code_points - s = ", s)
    print("rtlCP = ", group_consecutive(rtlCP))
    print("rtlCP% = ", round(float(len(rtlCP))/0x110000*100, 2), "%")
    print("forbiddenCP = ", group_consecutive(forbiddenCP))
    print("forbiddenCP% = ", round(float(len(forbiddenCP))/0x110000*100, 2), "%")

def main():
    id_rtl_code_points()

if __name__ == '__main__':
    main()
    

Run as it is, I get (I skip parts with dots):

$ ./test.py 
0x0-------------
0x0  : a
0x1  : a
0x2  : a
....................
0x21  : a!
0x22  : a"
0x23  : a#
0x24  : a$
....................
0x60  : a`
0x61  : aa
0x62  : ab
0x63  : ac
0x64  : ad
....................
0x98  : a
0x9a  : a
         0x9b  : a
9c  : a
0x9d  : a$ 1;1;120;120;1;0x

Not so good, I don't understand why it stops displaying.

If I forward stdout to /dev/null for the exception test (uncomment lines 33 and 41), I get:

$ ./test.py 
0x0-------------
0x10000-------------
0x20000-------------
0x30000-------------
0x40000-------------
0x50000-------------
0x60000-------------
0x70000-------------
0x80000-------------
0x90000-------------
0xa0000-------------
0xb0000-------------
0xc0000-------------
0xd0000-------------
0xe0000-------------
0xf0000-------------
0x100000-------------
rtlCP =  []
rtlCP% =  0.0 %
forbiddenCP =  [(0, 2048)]
forbiddenCP% =  0.18 %

The first 2048 code points would raise exception ? This is a silly result, of course not. I would have expected problems in the range U+D800 and U+DFFF.

Is my approach correct, then what do I miss, or is it non sense, then why?

lalebarde
  • 1,684
  • 1
  • 21
  • 36
  • 1
    You're printing every possible Unicode code point to the terminal, including control codes. That's naturally going to cause weird display issues. – user2357112 Feb 01 '22 at 23:11
  • 1
    Check out [unicodedata.bidirectional()](https://docs.python.org/3/library/unicodedata.html#unicodedata.bidirectional) to directly read the bidirectional property of a character. – Mark Tolonen Feb 02 '22 at 00:56

0 Answers0