2

I am attempting to write a script that will Normalize artist names stored in my MP3 files. The issue I am running into is that the unicodedata.normalize function referenced here translates most accented characters to 'a'. Here is the code and output

def remove_accents(data):
return ''.join(x for x in unicodedata.normalize('NFD', data) if x in string.ascii_letters).lower()

string2 = "Mötley Crüe"

string3 = makeEnglish3.convertChars(string2)
print(string3)

Output

matleycrae

I would expect motleycrue, what am I doing wrong?

ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
  • 2
    Assuming `makeEnglish3.convertChars(string2)` is supposed to say `remove_accents(string2)` (and that your indentation error isn't in your actual code), this works fine for me, and returns `motleycrue`. – Zero Piraeus Feb 16 '17 at 16:01
  • What’s the output of `print(repr(string2))`? Your original string probably isn’t what you expect it to be; ö is `c3 b6` in UTF-8, and U+00C3 is “À”. – Ry- Feb 16 '17 at 16:13
  • 2
    Ugh, figured out the issue. Had this at the top of my file. # -*- coding: latin-1 -*- – Robert McDougal Feb 16 '17 at 16:20
  • 3
    You could post that as an answer. Other people might also run into this issue – trent Feb 16 '17 at 16:23

1 Answers1

2

Take a look at what happens when this latin1 string is printed.

# -- coding: latin-1 --
string2 = "Mötley Crüe"
print(string2)

It prints Mötley Crüe. Notice that the characters after à are not alphabetical. In the following snippet, which is essentially your remove_accents() function, it's going to strip the non-alphabetical characters out:

import unicodedata
for x in unicodedata.normalize('NFD', string2):
    if x in string.ascii_letters:
       print(x, end='')

outputs MAtleyCrAe. Even using a string with different combining characters like shåpÈ still prints as shÃ¥pÃ, because the byte sequences in your python file are being interpreted as latin1, even though your file is saved in the filesystem as UTF-8.

Why are characters are being read as 'Ã'?

Lets check the byte composition of these characters:

>>> bytes('ö', 'utf-8')
b'\xc3\xb6'
>>> bytes('ü', 'utf-8')
b'\xc3\xbc'
>>> bytes('ö', 'utf-8').decode('latin1')
'ö'
>>> bytes('ü', 'utf-8').decode('latin1')
'ü'

You can see that these accented characters are represented as \xc3 followed by some other hex number. In latin1, \xc3 is Ã. Since the other character could possibly translate into a letter too, you might see à followed by a random letter too. This is what's making 'a' in your output (after you remove the accent and turn it to lowercase).

Zenul_Abidin
  • 573
  • 8
  • 23