Using unicode / umlauts in Python: Dictionary v manual input

Question

I am using a dictionary to store some character pairs in Python (I am replacing umlaut characters). Here is what it looks like:

umlautdict={
    'ae': 'ä',
    'ue': 'ü',
    'oe': 'ö'
    }

Then I run my inputwords through it like so:

for item in umlautdict.keys():
        outputword=inputword.replace(item,umlautdict[item])

But this does not do anything (no replacement happens). When I printed out my umlautdict, I saw that it looks like this:

{'ue': '\xfc', 'oe': '\xf6', 'ae': '\xc3\xa4'}

Of course that is not what I want; however, trying things like unicode() (--> Error) or pre-fixing u did not improve things.

If I type the 'ä' or 'ö' into the replace() command by hand, everything works just fine. I also changed the settings in my script (working in TextWrangler) to # -*- coding: utf-8 -*- as it would net even let me execute the script containing umlauts without it.

So I don't get...

Why does this happen? Why and when do the umlauts change from "good to evil" when I store them in the dictionary?
How do I fix it?
Also, if anyone knows: what is a good resource to learn about encoding in Python? I have issues all the time and so many things don't make sense to me / I can't wrap my head around.

I'm working on a Mac in Python 2.7.10. Thanks for your help!

Pretty sure it **does** work and you're just messing it up with that non-logical use of `inputword` and `outputword`. — Stefan Pochmann, Jun 28 '16 at 18:11
Take a look at `str.translate` method. It's more proper for such tasks. — Mazdak, Jun 28 '16 at 18:13
For a brief overview of Unicode in Python 2.x see: http://stackoverflow.com/questions/21129020/how-to-fix-unicodedecodeerror-ascii-codec-cant-decode-byte/35444608#35444608 — Alastair McCormack, Jun 30 '16 at 15:38

score 3 · Answer 1 · answered Jun 29 '16 at 01:05

Converting to Unicode is done by decoding your string (assuming you're getting bytes):

data = "haer ueber loess"
word = data.decode('utf-8')  # actual encoding depends on your data

Define your dict with unicode strings as well:

umlautdict={
    u'ae': u'ä',
    u'ue': u'ü',
    u'oe': u'ö'
    }

and finally print umlautdict will print out some representation of that dict, usually involving escapes. That's normal, you don't have to worry about that.

score 2 · Accepted Answer · answered Jun 28 '16 at 18:15

Declare your coding.
Use raw format for the special characters.
Iterate properly on your string: keep the changes from each loop iteration as you head to the next.

Here's code to get the job done:

\# -*- coding: utf-8 -*-

umlautdict = {
    'ae': r'ä',
    'ue': r'ü',
    'oe': r'ö'
    }

print umlautdict

inputword = "haer ueber loess"
for item in umlautdict.keys():
        inputword = inputword.replace(item, umlautdict[item])

print inputword

Output:

{'ue': '\xc3\xbc', 'oe': '\xc3\xb6', 'ae': '\xc3\xa4'}
här über löss

Using unicode / umlauts in Python: Dictionary v manual input

2 Answers2