0

I am interested in learning how to strip diacritics from text. That said, I am trying to better understand what's going on in following lines of code, which I found in a post from 2017 (How to replace accented characters in python?):

import unicodedata

text = unicodedata.normalize('NFD', text).encode('ascii','ignore').decode('utf-8')

Here is my rough understanding:

unicodedata.normalize('NFD',text) translates each character into its decomposed form (e.g. à becomes a), .

.encode('ascii','ignore') converts the normalized text into an ascii byte string (b'string') and ignores any errors.

.decode('utf-8') returns the string decoded from the given bytes, but this is where I get stuck. Why not use .decode('ascii') instead? Are the two encodings overlapping?

1 Answers1

1

Your understanding is mostly correct. The trick is .encode('ascii', 'ignore'). ASCII can only express 128 characters, and a lot of those aren't even printable. The ASCII character set certainly contains no characters with diacritics. So it's trying to force the text into the ASCII characters set, and ignore causes all characters which it can't express to be silently discarded; which gets rid of all those decomposed diacritics.

You're right that decoding it as UTF-8 doesn't inherently make a lot of sense; decoding it as ASCII would make more sense. But, like a lot of encodings, UTF-8 is a superset of ASCII. Any valid ASCII string is also a valid UTF-8 string, or a valid ISO-8859-1 string, or a valid string in a lot of other encodings. You could decode it in any of those compatible encodings and get the same result. The author explicitly choosing UTF-8 is… just slightly bizarre, but technically inconsequential.

deceze
  • 510,633
  • 85
  • 743
  • 889