8

I have a problem with converting uppercase letters with umlauts to lowercase ones.

print("ÄÖÜAOU".lower())

The A, O and the U gets converted properly but the Ä,Ö and Ü stays uppercase. Any ideas?

First problem is fixed with the .decode('utf-8') but I still have a second one:

# -*- coding: utf-8 -*-
original_message="ÄÜ".decode('utf-8')
original_message=original_message.lower()
original_message=original_message.replace("ä", "x")
print(original_message)

Traceback (most recent call last): File "Untitled.py", line 4, in original_message=original_message.replace("ä", "x") UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

smci
  • 32,567
  • 20
  • 113
  • 146
user2104634
  • 83
  • 1
  • 4

3 Answers3

9

You'll need to mark it as a unicode string unless you're working with plain ASCII;

> print(u"ÄÖÜAOU".lower())

äöüaou

It works the same when working with variables, it all depends on the type assigned to the variable to begin with.

> olle = "ÅÄÖABC"
> print(olle.lower())
ÅÄÖabc

> olle = u"ÅÄÖABC"
> print(olle.lower())
åäöabc
Joachim Isaksson
  • 176,943
  • 25
  • 281
  • 294
  • I have # -*- coding: utf-8 -*- in the first line, looks like its the Python version as BlaXpirit suggest. – user2104634 Feb 24 '13 at 14:50
  • @user2104634 The above example was run on standard Python 2.7.2 on Mac OS X. Without marking as unicode, it will only convert ascii characters to lower case, with the `u` marker, it gives the correct output. – Joachim Isaksson Feb 24 '13 at 14:51
  • So the tag in the beginning is not enough? – user2104634 Feb 24 '13 at 14:54
  • The tag just tells Python the encoding of the file. – Matthias Feb 24 '13 at 14:58
  • 1
    @user2104634 Just as Matthias says, the coding metadata only helps Python to correctly detect the encoding of the file, it has nothing to do with ascii versus unicode strings at runtime. – Joachim Isaksson Feb 24 '13 at 14:59
  • @user2104634 If original_message contains a unicode string, yes, it will work just fine. Added an example to the answer. – Joachim Isaksson Feb 24 '13 at 15:07
  • Problem is the variable comes from a raw_input – user2104634 Feb 24 '13 at 15:10
  • It does work until the script hits a point, where it should replace characters. – user2104634 Feb 24 '13 at 15:14
  • @user2104634 If you're doing raw_input from stdin, you can get it as a unicode string using `olle=raw_input().decode(sys.stdin.encoding)` instead of just `olle=raw_input()`. – Joachim Isaksson Feb 24 '13 at 15:15
  • As said, if I do that I get an error in the replace part of the script: File "KORO.py", line 46, in replace c("ä", "335") File "KORO.py", line 200, in c original_message=original_message.replace(letter, number) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) – user2104634 Feb 24 '13 at 15:18
  • added an example to the question. – user2104634 Feb 24 '13 at 15:27
  • @user2104634 Regarding your addition, you need to replace using unicode strings too; `original_message=original_message.replace(u"ä", u"x")` works well. – Joachim Isaksson Feb 24 '13 at 15:38
3

You are dealing with encoded strings, not with unicode text.

The .lower() method of byte strings can only deal with ASCII values. Decode your string to Unicode or use a unicode literal (u''), then lowercase:

>>> print u"\xc4AOU".lower()
äaou
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • @user2104634: you need to read the [Python Unicode HOWTO](http://docs.python.org/2/howto/unicode.html); you decode the variable to a `unicode` value (`variable.decode(encoding')`). – Martijn Pieters Feb 24 '13 at 15:00
2

If you're using Python 2 but don't want to prefix u"" on all your strings put this at the beginning of your program:

from __future__ import unicode_literals
olle = "ÅÄÖABC"
print(olle.lower())

will now return:

åäöabc

The encoding specifies how to interpret the characters read in from disk into a program, but the from __ future __ import statement tells how to interpret these strings within the program itself. You will probably need both.