-2

I am using a dictionary to store some character pairs in Python (I am replacing umlaut characters). Here is what it looks like:

umlautdict={
    'ae': 'ä',
    'ue': 'ü',
    'oe': 'ö'
    }

Then I run my inputwords through it like so:

for item in umlautdict.keys():
        outputword=inputword.replace(item,umlautdict[item])

But this does not do anything (no replacement happens). When I printed out my umlautdict, I saw that it looks like this:

{'ue': '\xfc', 'oe': '\xf6', 'ae': '\xc3\xa4'}

Of course that is not what I want; however, trying things like unicode() (--> Error) or pre-fixing u did not improve things.

If I type the 'ä' or 'ö' into the replace() command by hand, everything works just fine. I also changed the settings in my script (working in TextWrangler) to # -*- coding: utf-8 -*- as it would net even let me execute the script containing umlauts without it.

So I don't get...

  • Why does this happen? Why and when do the umlauts change from "good to evil" when I store them in the dictionary?

  • How do I fix it?

  • Also, if anyone knows: what is a good resource to learn about encoding in Python? I have issues all the time and so many things don't make sense to me / I can't wrap my head around.

I'm working on a Mac in Python 2.7.10. Thanks for your help!

patrick
  • 4,455
  • 6
  • 44
  • 61

2 Answers2

3

Converting to Unicode is done by decoding your string (assuming you're getting bytes):

data = "haer ueber loess"
word = data.decode('utf-8')  # actual encoding depends on your data

Define your dict with unicode strings as well:

umlautdict={
    u'ae': u'ä',
    u'ue': u'ü',
    u'oe': u'ö'
    }

and finally print umlautdict will print out some representation of that dict, usually involving escapes. That's normal, you don't have to worry about that.

roeland
  • 5,349
  • 2
  • 14
  • 28
2
  1. Declare your coding.
  2. Use raw format for the special characters.
  3. Iterate properly on your string: keep the changes from each loop iteration as you head to the next.

Here's code to get the job done:

\# -*- coding: utf-8 -*-

umlautdict = {
    'ae': r'ä',
    'ue': r'ü',
    'oe': r'ö'
    }

print umlautdict

inputword = "haer ueber loess"
for item in umlautdict.keys():
        inputword = inputword.replace(item, umlautdict[item])

print inputword

Output:

{'ue': '\xc3\xbc', 'oe': '\xc3\xb6', 'ae': '\xc3\xa4'}
här über löss
Prune
  • 76,765
  • 14
  • 60
  • 81