You get some trouble from multiple shortcomings in Python (2.x).
raw_input()
gives you raw bytes from the system with no encoding info
- Native encoding for python strings is 'ascii', which cannot represent 'ü'
- The encoding of the literal in your script is either ascii or needs to be declared in a header at the top of the file
So if you have a simple file like this:
x = {'ü': 20, 'ä': 10}
And run it with python you get an error, because the encoding is unknown:
SyntaxError: Non-ASCII character '\xfc' in file foo.py on line 1, but no encoding declared;
see http://python.org/dev/peps/pep-0263/ for details
This can be fixed, of course, by adding an encoding header to the file and turning the literals into unicode literals.
For example, if the encoding is CP1252 (like a German Windows GUI):
# -*- coding: cp1252 -*-
x = {u'ü': 20, u'ä':30}
print repr(x)
This prints:
{u'\xfc': 20, u'\xe4': 30}
But if you get the header wrong (e.g. write CP850 instead of CP1252, but keep the same content), it prints:
{u'\xb3': 20, u'\xf5': 30}
Totally different.
So first check that your editor settings match the encoding header in your file, otherwise all non-ascii literals will simply be wrong.
Next step is fixing raw_input()
. It does what it says it does, providing you raw input from the console. Just bytes. But an 'ü' can be represented with a lot of different bytes 0xfc
for ISO-8859-1, CP1252, CP850 etc., 0xc3
+ 0xbc
in UTF-8, 0x00
+ 0xfc
or 0xfc
+ 0x00
in UTF-16, and so on.
So your code has two issues with that:
for letter in text:
If text
happens to be a simple byte string in a multibyte encoding (e.g UTF-8, UTF-16, some others), one-byte is not equal to one letter, so iterating like that over the string will not do what you expect. For a very simplified view of letter
you might be able to do that kind of iteration with the python unicode strings (if properly normalized). So you need to make sure text
is a unicode string first.
How to convert from a byte string to unicode? A bytestring offers the decode()
method, which takes an encoding. A good first guess for that encoding is the piece of code here sys.stdin.encoding or locale.getpreferredencoding(True)
)
Putting things together:
alpha_dict = {u'\xfc': u'small umlaut u'}
text = raw_input()
# turn text into unicode
utext = text.decode(sys.stdin.encoding or locale.getpreferredencoding(True))
# iterate over unicode string, not really letters...
for letter in utext:
x=[alpha_dic[letter]]
print x