This question is relative to this one. But as my tried solution does not work, I open a new question to deal with my specific problems.
Context:
In the application I develop, I need to build python regex that includes unicodes, possibly in the whole range(0, 0x110000)
. When I build my regex, for example with the following:
regex += mycodepoint_as_char + ".{0," + str(max_repeat) + "}"
I observes that for some code points, the order is reversed as if I had written:
regex += "{0," + str(max_repeat) + "}." + mycodepoint_as_char
regex = ή.{0,2}{0,3}.䝆⚭.{0,3}俩.{0,4}ⷭ
In other cases, I have an exception.
So I studied the norm for biderectional unicode and some Q/A that explain surrogate pairs, Left-To-Right and Right-To-Left special code points, and some prohibited ones reserved for UTF-16.
My problem:
Then I have decided to test all of them, and to build a list of RTL ones and prohibited ones, assuming the first would change the order in the string, and that the last would raise an exception.
Here is my test code:
#!/usr/bin/python3
import sys
import os
import unicodedata #https://docs.python.org/fr/3/library/unicodedata.html, https://fr.wikipedia.org/wiki/Normalisation_Unicode
def group_consecutive(l):
res = []
i1 = 0
i2 = 0
while i1 < len(l):
while i2 + 1 < len(l) and l[i2+1] == l[i2] + 1:
i2 += 1
res.append((i1, i2+1)) # range(i1, i2+1) has consecutive values
i1 = i2+1
i2 = i1
return res
def id_rtl_code_points():
oldstdout = sys.stdout # https://stackoverflow.com/questions/8777152/unable-to-restore-stdout-to-original-only-to-terminal
nullstdout = open(os.devnull, 'w') # https://stackoverflow.com/questions/26837247/how-to-disable-print-statements-conveniently-so-that-pythonw-can-run?noredirect=1&lq=1
forbiddenCP = []
sep = 'a' # choose a letter that can receive modifiers
s = ""
for i in range(0, 0x110000):
if i%0x10000 == 0:
print(hex(i) + "-------------") # show progress
try:
if len(s) % 2 == 1: #keep synchronised, sep on modulo = 0, chr(i) on modulo = 1
s += sep
#sys.stdout = nullstdout
print(hex(i), " : " + sep + chr(i)) # without print, no error
except:
forbiddenCP.append(i)
else:
s += sep + chr(i)
finally:
pass
#sys.stdout = oldstdout
s += sep
rtlCP = []
for i in range(0, 0x110000,2):
if s[i] != sep: #not sure at all this algorythm is right
rtlCP.append(ord(s[i]))
sys.stdout = oldstdout
#print("id_rtl_code_points - s = ", s)
print("rtlCP = ", group_consecutive(rtlCP))
print("rtlCP% = ", round(float(len(rtlCP))/0x110000*100, 2), "%")
print("forbiddenCP = ", group_consecutive(forbiddenCP))
print("forbiddenCP% = ", round(float(len(forbiddenCP))/0x110000*100, 2), "%")
def main():
id_rtl_code_points()
if __name__ == '__main__':
main()
Run as it is, I get (I skip parts with dots):
$ ./test.py
0x0-------------
0x0 : a
0x1 : a
0x2 : a
....................
0x21 : a!
0x22 : a"
0x23 : a#
0x24 : a$
....................
0x60 : a`
0x61 : aa
0x62 : ab
0x63 : ac
0x64 : ad
....................
0x98 : a
0x9a : a
0x9b : a
9c : a
0x9d : a$ 1;1;120;120;1;0x
Not so good, I don't understand why it stops displaying.
If I forward stdout to /dev/null
for the exception test (uncomment lines 33 and 41), I get:
$ ./test.py
0x0-------------
0x10000-------------
0x20000-------------
0x30000-------------
0x40000-------------
0x50000-------------
0x60000-------------
0x70000-------------
0x80000-------------
0x90000-------------
0xa0000-------------
0xb0000-------------
0xc0000-------------
0xd0000-------------
0xe0000-------------
0xf0000-------------
0x100000-------------
rtlCP = []
rtlCP% = 0.0 %
forbiddenCP = [(0, 2048)]
forbiddenCP% = 0.18 %
The first 2048 code points would raise exception ? This is a silly result, of course not. I would have expected problems in the range U+D800 and U+DFFF.
Is my approach correct, then what do I miss, or is it non sense, then why?