Why does normalization change undefined characters to 'a'?

Question

I am attempting to write a script that will Normalize artist names stored in my MP3 files. The issue I am running into is that the unicodedata.normalize function referenced here translates most accented characters to 'a'. Here is the code and output

def remove_accents(data):
return ''.join(x for x in unicodedata.normalize('NFD', data) if x in string.ascii_letters).lower()

string2 = "Mötley Crüe"

string3 = makeEnglish3.convertChars(string2)
print(string3)

Output

matleycrae

I would expect motleycrue, what am I doing wrong?

Assuming `makeEnglish3.convertChars(string2)` is supposed to say `remove_accents(string2)` (and that your indentation error isn't in your actual code), this works fine for me, and returns `motleycrue`. — Zero Piraeus, Feb 16 '17 at 16:01
What’s the output of `print(repr(string2))`? Your original string probably isn’t what you expect it to be; ö is `c3 b6` in UTF-8, and U+00C3 is “À”. — Ry-, Feb 16 '17 at 16:13
Ugh, figured out the issue. Had this at the top of my file. # -*- coding: latin-1 -*- — Robert McDougal, Feb 16 '17 at 16:20
You could post that as an answer. Other people might also run into this issue — trent, Feb 16 '17 at 16:23

score 2 · Answer 1 · answered Dec 30 '19 at 21:25

Take a look at what happens when this latin1 string is printed.

# -- coding: latin-1 --
string2 = "Mötley Crüe"
print(string2)

It prints MÃ¶tley CrÃ¼e. Notice that the characters after Ã are not alphabetical. In the following snippet, which is essentially your remove_accents() function, it's going to strip the non-alphabetical characters out:

import unicodedata
for x in unicodedata.normalize('NFD', string2):
    if x in string.ascii_letters:
       print(x, end='')

outputs MAtleyCrAe. Even using a string with different combining characters like shåpÈ still prints as shÃ¥pÃ, because the byte sequences in your python file are being interpreted as latin1, even though your file is saved in the filesystem as UTF-8.

Why are characters are being read as 'Ã'?

Lets check the byte composition of these characters:

>>> bytes('ö', 'utf-8')
b'\xc3\xb6'
>>> bytes('ü', 'utf-8')
b'\xc3\xbc'
>>> bytes('ö', 'utf-8').decode('latin1')
'Ã¶'
>>> bytes('ü', 'utf-8').decode('latin1')
'Ã¼'

You can see that these accented characters are represented as \xc3 followed by some other hex number. In latin1, \xc3 is Ã. Since the other character could possibly translate into a letter too, you might see Ã followed by a random letter too. This is what's making 'a' in your output (after you remove the accent and turn it to lowercase).

Why does normalization change undefined characters to 'a'?

1 Answers1

Why are characters are being read as 'Ã'?