0

Normally I use unicodedata to normalize other latin-ish text. However, I've come across this and not sure what to do:

s = 'Nguyễn Văn Trỗi'
>>> unicodedata.normalize('NFD', s)
'Nguyễn Văn Trỗi'

Is there another module that can normalize more accents than unicodedata ? The output I want is:

Nguyen Van Troi

samuelbrody1249
  • 4,379
  • 1
  • 15
  • 58

1 Answers1

1

normalize doesn't mean "remove accents". It is converting between composed and decomposed forms:

>>> import unicodedata as ud
>>> a = 'ă'
>>> print(ascii(ud.normalize('NFD',a)))  # LATIN SMALL LETTER A + COMBINING BREVE
'a\u0306'
>>> print(ascii(ud.normalize('NFC',a)))  # LATIN SMALL LETTER A WITH BREVE
'\u0103'

One way to remove them is to then encode the decomposed form as ASCII ignoring errors, which works because combining characters are not ASCII. Note, however, that not all international characters have decomposed forms, such as đ.

>>> s = 'Nguyễn Văn Trỗi'
>>> ud.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'Nguyen Van Troi'

>>> s = 'Ngô Đình Diệm'
>>> ud.normalize('NFD',s).encode('ascii',errors='ignore').decode('ascii')
'Ngo inh Diem' # error

You can work around the exceptions with a translation table:

>>> table = {ord('Đ'):'D',ord('đ'):'d'}
>>> ud.normalize('NFD',s).translate(table).encode('ascii',errors='ignore').decode('ascii')
'Ngo Dinh Diem'
Mark Tolonen
  • 166,664
  • 26
  • 169
  • 251