python: lower() german umlauts

Question

I have a problem with converting uppercase letters with umlauts to lowercase ones.

print("ÄÖÜAOU".lower())

The A, O and the U gets converted properly but the Ä,Ö and Ü stays uppercase. Any ideas?

First problem is fixed with the .decode('utf-8') but I still have a second one:

# -*- coding: utf-8 -*-
original_message="ÄÜ".decode('utf-8')
original_message=original_message.lower()
original_message=original_message.replace("ä", "x")
print(original_message)

Traceback (most recent call last): File "Untitled.py", line 4, in original_message=original_message.replace("ä", "x") UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128)

Are you using python 2 or 3? – Martijn Pieters Feb 24 '13 at 14:46 — Martijn Pieters, Feb 24 '13 at 14:46
Python 2.7.2 the one shipped with OSX. – user2104634 Feb 24 '13 at 14:46 — user2104634, Feb 24 '13 at 14:46
@user2104634 There's your problem. – Oleh Prypin Feb 24 '13 at 14:46 — Oleh Prypin, Feb 24 '13 at 14:46

Joachim Isaksson · Accepted Answer · 2013-02-24T15:07:06.867

9

You'll need to mark it as a unicode string unless you're working with plain ASCII;

> print(u"ÄÖÜAOU".lower())

äöüaou

It works the same when working with variables, it all depends on the type assigned to the variable to begin with.

> olle = "ÅÄÖABC"
> print(olle.lower())
ÅÄÖabc

> olle = u"ÅÄÖABC"
> print(olle.lower())
åäöabc

edited Feb 24 '13 at 15:07

answered Feb 24 '13 at 14:47

Joachim Isaksson

176,943
25
281
294

I have # -*- coding: utf-8 -*- in the first line, looks like its the Python version as BlaXpirit suggest. – user2104634 Feb 24 '13 at 14:50
@user2104634 The above example was run on standard Python 2.7.2 on Mac OS X. Without marking as unicode, it will only convert ascii characters to lower case, with the `u` marker, it gives the correct output. – Joachim Isaksson Feb 24 '13 at 14:51
So the tag in the beginning is not enough? – user2104634 Feb 24 '13 at 14:54
The tag just tells Python the encoding of the file. – Matthias Feb 24 '13 at 14:58
1

@user2104634 Just as Matthias says, the coding metadata only helps Python to correctly detect the encoding of the file, it has nothing to do with ascii versus unicode strings at runtime. – Joachim Isaksson Feb 24 '13 at 14:59
@user2104634 If original_message contains a unicode string, yes, it will work just fine. Added an example to the answer. – Joachim Isaksson Feb 24 '13 at 15:07
Problem is the variable comes from a raw_input – user2104634 Feb 24 '13 at 15:10
It does work until the script hits a point, where it should replace characters. – user2104634 Feb 24 '13 at 15:14
@user2104634 If you're doing raw_input from stdin, you can get it as a unicode string using `olle=raw_input().decode(sys.stdin.encoding)` instead of just `olle=raw_input()`. – Joachim Isaksson Feb 24 '13 at 15:15
As said, if I do that I get an error in the replace part of the script: File "KORO.py", line 46, in replace c("ä", "335") File "KORO.py", line 200, in c original_message=original_message.replace(letter, number) UnicodeDecodeError: 'ascii' codec can't decode byte 0xc3 in position 0: ordinal not in range(128) – user2104634 Feb 24 '13 at 15:18
added an example to the question. – user2104634 Feb 24 '13 at 15:27
@user2104634 Regarding your addition, you need to replace using unicode strings too; `original_message=original_message.replace(u"ä", u"x")` works well. – Joachim Isaksson Feb 24 '13 at 15:38

score 3 · Answer 2 · answered Feb 24 '13 at 14:48

3

You are dealing with encoded strings, not with unicode text.

The .lower() method of byte strings can only deal with ASCII values. Decode your string to Unicode or use a unicode literal (u''), then lowercase:

>>> print u"\xc4AOU".lower()
äaou

answered Feb 24 '13 at 14:48

Martijn Pieters

1,048,767
296
4,058
3,343

@user2104634: you need to read the [Python Unicode HOWTO](http://docs.python.org/2/howto/unicode.html); you decode the variable to a `unicode` value (`variable.decode(encoding')`). – Martijn Pieters Feb 24 '13 at 15:00

score 2 · Answer 3 · answered Feb 24 '13 at 16:00

2

If you're using Python 2 but don't want to prefix u"" on all your strings put this at the beginning of your program:

from __future__ import unicode_literals
olle = "ÅÄÖABC"
print(olle.lower())

will now return:

åäöabc

The encoding specifies how to interpret the characters read in from disk into a program, but the from __ future __ import statement tells how to interpret these strings within the program itself. You will probably need both.

answered Feb 24 '13 at 16:00

Michael Scott Asato Cuthbert

3,442
2
22
52

today, my suggestion would be -- use Python 3. unicode_literals doesn't work in enough places to be worth it. – Michael Scott Asato Cuthbert Nov 08 '18 at 18:14

python: lower() german umlauts

3 Answers3