Umlaut in raw_input()

Question

I am currently learning Python and I came across the following code:

text=raw_input()
for letter in text:
    x=[alpha_dic[letter]]
    print x

When I write an umlaut (in the dictionary by the way) it gives me an error like -KeyError: '\xfc'- (for ü in this case) because the umlauts are saved internally in this way! I saw some solutions with unicode encoding or utf but either I am not skilled enough to apply it correctly or maybe it simply does not work that way.

What do you want to display for a letter that isn't in `alpha_dic`? I think that an understanding of what you're trying to accomplish with this code would also be helpful in solving this problem. e.g. if you don't mind returning some default value for missing letters, then `alpha_dic.get('letter', default)` might be an option... — mgilson, Apr 11 '16 at 16:16

score 0 · Answer 1 · edited May 23 '17 at 11:59

0

I got this to work borrowing from this answer:

# -*- coding: utf-8 -*-
import sys, locale
alpha_dict = {u"ü":"umlaut"}
text= raw_input().decode(sys.stdin.encoding or locale.getpreferredencoding(True))
for letter in text:
    x=[alpha_dict[unicode(letter)]]
    print x

>>> ü
>>> ['umlaut']

Python 2 and unicode are not for the feint of heart...

edited May 23 '17 at 11:59

Community

1
1

answered Apr 11 '16 at 16:22

RickyA

15,465
5
71
95

score 0 · Accepted Answer · answered Apr 11 '16 at 17:36

You get some trouble from multiple shortcomings in Python (2.x).

raw_input() gives you raw bytes from the system with no encoding info
Native encoding for python strings is 'ascii', which cannot represent 'ü'
The encoding of the literal in your script is either ascii or needs to be declared in a header at the top of the file

So if you have a simple file like this:

x = {'ü': 20, 'ä': 10}

And run it with python you get an error, because the encoding is unknown:

SyntaxError: Non-ASCII character '\xfc' in file foo.py on line 1, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details

This can be fixed, of course, by adding an encoding header to the file and turning the literals into unicode literals.

For example, if the encoding is CP1252 (like a German Windows GUI):

# -*- coding: cp1252 -*-
x = {u'ü': 20, u'ä':30}
print repr(x)

This prints:

{u'\xfc': 20, u'\xe4': 30}

But if you get the header wrong (e.g. write CP850 instead of CP1252, but keep the same content), it prints:

{u'\xb3': 20, u'\xf5': 30}

Totally different.

So first check that your editor settings match the encoding header in your file, otherwise all non-ascii literals will simply be wrong.

Next step is fixing raw_input(). It does what it says it does, providing you raw input from the console. Just bytes. But an 'ü' can be represented with a lot of different bytes 0xfc for ISO-8859-1, CP1252, CP850 etc., 0xc3 + 0xbc in UTF-8, 0x00 + 0xfc or 0xfc + 0x00 in UTF-16, and so on.

So your code has two issues with that:

for letter in text:

If text happens to be a simple byte string in a multibyte encoding (e.g UTF-8, UTF-16, some others), one-byte is not equal to one letter, so iterating like that over the string will not do what you expect. For a very simplified view of letter you might be able to do that kind of iteration with the python unicode strings (if properly normalized). So you need to make sure text is a unicode string first.

How to convert from a byte string to unicode? A bytestring offers the decode() method, which takes an encoding. A good first guess for that encoding is the piece of code here sys.stdin.encoding or locale.getpreferredencoding(True))

Putting things together:

alpha_dict = {u'\xfc': u'small umlaut u'}
text = raw_input()
# turn text into unicode
utext = text.decode(sys.stdin.encoding or locale.getpreferredencoding(True))
# iterate over unicode string, not really letters...
for letter in utext:
    x=[alpha_dic[letter]]
    print x

Thanks, your answer helped me greatly but solved only one side the problem. The umlauts are now recognized by Python but how do I print them? Like `alpha_dict = {key: 'ü'}` @RickyA — Cetarius, Apr 12 '16 at 18:59
Depends on your environment. If you experiment inside `IDLE` you can just print it and it works. If you are in a console on Windows it fails, unless mess around with encodings. On Linux with an UTF-8 console it simply works too. — schlenk, Apr 12 '16 at 19:40

Umlaut in raw_input()

2 Answers2