I am interested in learning how to strip diacritics from text. That said, I am trying to better understand what's going on in following lines of code, which I found in a post from 2017 (How to replace accented characters in python?):
import unicodedata
text = unicodedata.normalize('NFD', text).encode('ascii','ignore').decode('utf-8')
Here is my rough understanding:
unicodedata.normalize('NFD',text)
translates each character into its decomposed form (e.g. à becomes a), .
.encode('ascii','ignore')
converts the normalized text into an ascii byte string (b'string'
) and ignores any errors.
.decode('utf-8')
returns the string decoded from the given bytes, but this is where I get stuck. Why not use .decode('ascii')
instead? Are the two encodings overlapping?