Take a look at what happens when this latin1 string is printed.
# -- coding: latin-1 --
string2 = "Mötley Crüe"
print(string2)
It prints Mötley Crüe
. Notice that the characters after Ã
are not alphabetical. In the following snippet, which is essentially your remove_accents()
function, it's going to strip the non-alphabetical characters out:
import unicodedata
for x in unicodedata.normalize('NFD', string2):
if x in string.ascii_letters:
print(x, end='')
outputs MAtleyCrAe
. Even using a string with different combining characters like shåpÈ
still prints as shÃ¥pÃ
, because the byte sequences in your python file are being interpreted as latin1, even though your file is saved in the filesystem as UTF-8.
Why are characters are being read as 'Ã'?
Lets check the byte composition of these characters:
>>> bytes('ö', 'utf-8')
b'\xc3\xb6'
>>> bytes('ü', 'utf-8')
b'\xc3\xbc'
>>> bytes('ö', 'utf-8').decode('latin1')
'ö'
>>> bytes('ü', 'utf-8').decode('latin1')
'ü'
You can see that these accented characters are represented as \xc3
followed by some other hex number. In latin1, \xc3
is Ã. Since the other character could possibly translate into a letter too, you might see à followed by a random letter too. This is what's making 'a' in your output (after you remove the accent and turn it to lowercase).