Normalize foreign text

Question

Normally I use unicodedata to normalize other latin-ish text. However, I've come across this and not sure what to do:

s = 'Nguyễn Văn Trỗi'
>>> unicodedata.normalize('NFD', s)
'Nguyễn Văn Trỗi'

Is there another module that can normalize more accents than unicodedata ? The output I want is:

Nguyen Van Troi

Mark Tolonen · Accepted Answer · 2020-03-26T02:23:35.890

normalize doesn't mean "remove accents". It is converting between composed and decomposed forms:

>>> import unicodedata as ud
>>> a = 'ă'
>>> print(ascii(ud.normalize('NFD',a)))  # LATIN SMALL LETTER A + COMBINING BREVE
'a\u0306'
>>> print(ascii(ud.normalize('NFC',a)))  # LATIN SMALL LETTER A WITH BREVE
'\u0103'

One way to remove them is to then encode the decomposed form as ASCII ignoring errors, which works because combining characters are not ASCII. Note, however, that not all international characters have decomposed forms, such as đ.

>>> s = 'Nguyễn Văn Trỗi'
>>> ud.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'Nguyen Van Troi'

>>> s = 'Ngô Đình Diệm'
>>> ud.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'Ngo inh Diem' # error

You can work around the exceptions with a translation table:

>>> table = {ord('Đ'):'D',ord('đ'):'d'}
>>> ud.normalize('NFD',s).translate(table).encode('ascii',errors='ignore').decode('ascii')
'Ngo Dinh Diem'

how many exceptions are there? – TomSawyer Jan 19 '21 at 09:50 — TomSawyer, Jan 19 '21 at 09:50

Normalize foreign text

1 Answers1