With the information @Richard supplied in his answer, I came up with this short Python 3 script that implements UTS#39:
"""Implement the simple algorithm laid out in UTS#39, paragraph 4
"""
import csv
import re
import unicodedata
comment_pattern = re.compile(r'\s*#.*$')
def skip_comments(lines):
"""
A filter which skip/strip the comments and yield the
rest of the lines
:param lines: any object which we can iterate through such as a file
object, list, tuple, or generator
"""
for line in lines:
line = comment_pattern.sub('', line).strip()
if line:
yield line
def normalize(s):
return unicodedata.normalize("NFD", s)
def to_unicode(code_point):
return chr(int("0x" + code_point.lower(), 16))
def read_table(file_name):
d = {}
with open(file_name) as f:
reader = csv.reader(skip_comments(f), delimiter=";")
for row in reader:
source = to_unicode(row[0])
prototypes = map(to_unicode, row[1].strip().split())
d[source] = ''.join(prototypes)
return d
TABLE = read_table("confusables.txt")
def skeleton(s):
s = normalize(s)
s = ''.join(TABLE.get(c, c) for c in s)
return normalize(s)
def confusable(s1, s2):
return skeleton(s1) == skeleton(s2)
if __name__ == "__main__":
for strings in [("Foo", "Bar"), ("Αяe", "Are"), ("j", "j")]:
print(*strings)
print("Equal:", strings[0] == strings[1])
print("Confusable:", confusable(*strings), "\n")
It assumes that the file confusables.txt
is in the directory the script is being run from. In addition, I had to delete the first byte of that file, because it was some weird, not-printable, symbol.
It only follows the simple algorithm laid out at the beginning of paragraph 4, not the more complicated cases of whole- and mixed-script confusables laid out in 4.1 and 4.2. That is left as an exercise to the reader.
Note that "я" and "R" are not considered confusable by the unicode group, so this will return False
for those two strings.