encoding issue. Replace special character

Question

I have a dictionary that looks like this:

{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}

Now I'm trying to replace the \xf6 with ö , but trying .replace('\xf6', 'ö') returns an error:

UnicodeDecodeError: 'ascii' codec can't decode byte 0xf6 in position 0: ordinal not in range(128)

How can I fix this?

This is most likely Python2. But it would be nice of you to tag your question or mention which Python version you're using. Because Python2 and Python3 differs quite a lot in the decoding department. This issue is because default, the unicode string will decode using ASCII decoding because the use of non-defined encoding in your script and you're replacing ASCII with ASCII in your replace call.. IIRC. — Torxed, Jan 18 '17 at 15:19
How did you end up with an `\xf6` in a `u''` string to begin with…? — deceze, Jan 18 '17 at 15:20
`u'\xf6'` is the very same thing as `u'ö'`: `len(u'\xf6')==1`; `u'\xf6' == u'ö'` — , Jan 18 '17 at 15:21
I am running v3, but does adding `u` in the replace method work? I.e. `.replace(u'\xf6', u'ö')` — James, Jan 18 '17 at 15:24
@deceze not sure, but I read it from the database and this was how it shows up. I think someone tried to put a json string in the database and not paid attention to special characters. — Alex, Jan 18 '17 at 15:24
@James Yes. But they will be virtually the same as the original definition. — Torxed, Jan 18 '17 at 15:25
Are the commenters here all mad? You don't need to replace anything! `.replace(u'\xf6', u'ö')` is a no-op. — , Jan 18 '17 at 15:26
read this: https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals — , Jan 18 '17 at 15:31
As well as checking out the relevant Python docs, please take a look at [Pragmatic Unicode](http://nedbatchelder.com/text/unipain.html) by SO veteran Ned Batchelder. — PM 2Ring, Jan 18 '17 at 15:33
What OS are you using? And what encoding is your terminal using? What does `print u'Ganztags ge\xf6ffnet'` display in your terminal? Depending on your setup you may be able to print the Unicode object directly, otherwise you'll need to encode appropriately, eg `print u'Ganztags ge\xf6ffnet'.encode('utf8')` or `print u'Ganztags ge\xf6ffnet'.encode('latin1')`. — PM 2Ring, Jan 18 '17 at 15:39
Mysterious quick-fixes for character encoding problems in Python are going to be unstable. It's important to really understand what's happening. A great resource for this is this half-hour talk "Pragmatic Unidcode, or, How do I Stop the Pain?" https://www.youtube.com/watch?v=sgHbC6udIqc — Metropolis, Jan 18 '17 at 15:44
A normal string in Python 2 is plain 7 bit ASCII, not a Unicode string, and it certainly doesn't contain chars like `ö` — PM 2Ring, Jan 18 '17 at 15:49
@PM2Ring: As a non native english speaker I cannot agree with you. I have always used Latin1 strings in Python 2.7, and I have never found them *abnormal*. There **are** caveats around, but *displaying* unicode string is not always simpler... — Serge Ballesta, Jan 18 '17 at 17:04
@SergeBallesta Sure, you can put Latin-1 (aka ISO 8859-1) literals into a Python 2 script, but you need to give the script a `# -*- coding: latin-1 -*-` (or equivalent) coding directive. You can print such literals to the terminal if the terminal is set to use Latin-1, but if you try to decode them into Unicode you must specify the encoding or you get the dreaded "UnicodeDecodeError: 'ascii' codec can't decode byte [...] : ordinal not in range(128)" error. In my book, that qualifies Latin-1 literals as not normal, since you don't need to worry about any of those things with pure ASCII strings. — PM 2Ring, Jan 19 '17 at 07:04
@PM2Ring: I assume your book was written in English :-). Beware, the `# -*- coding: latin-1 -*` is far from a magic bullet. It is only used to insert unicode litterals in a Python script, and ignored for any other purpose. In particular it does not change the default encoding. But it does document the encoding for future readers and some text editors. Honnestly I mainly use it for that last reason. — Serge Ballesta, Jan 19 '17 at 07:13
@SergeBallesta Yes, I know that a coding directive is certainly _not_ a magic bullet - it only tells the interpreter how to decode the text of the script itself, it has no effect on how the script handles conversion to & from Unicode of any data it's processing; I wouldn't mind a dollar for every Unicode question I've seen on SO where the OP (or answerer) believes otherwise. ;) — PM 2Ring, Jan 19 '17 at 07:25

score 0 · Answer 1 · edited May 23 '17 at 12:01

Now encoding is a mine field, and I might be off on this one - please correct me if that's the case.

From what I've gathered over the years is that Python2 assumes ASCII unless you defined a encoding at the top of your script. Mainly because either it's compiled that way or the OS/Terminal uses ASCII as it's primary encoding.

With that said, what you see in your example data:

{ u'Samstag & Sonntag': u'Ganztags ge\xf6ffnet', u'Freitag': u'18:00 & 22:00'}

Is the ASCII representation of a unicode string. Some how Python needs to tell you there's an ö in there - but it can't with ASCII because ö has no representation in the ASCII table.

But when you try to replace it using:

x.replace('\xf6', 'ö')

You're trying to find a ASCII character/string called \xf6 that is outside of the accepted bytes ranges of ASCII, so that will raise an exception. And you're trying to replace it with another invalid ASCII character and that will cause the same exception.

Hence why you get the "'ascii' codec can't decode byte...' message.

You can do unicode replacements like this:

a = u'Ganztags ge\xf6ffnet'
a.replace(u'\xf6', u'ö')

This will tell Python to find a unicode string, and replace it with another unicode string.
But the output data will result in the same thing in the example above, because \xf6 is ö in unicode.

What you want to do, is encode your string into something you want to use, for instance - UTF-8:

a.encode('UTF-8')
'Ganztags ge\xc3\xb6ffnet'

And define UTF-8 as your primary encoding by placing this at the top of your code:

#!/usr/bin/python
# -*- coding: UTF-8

This should in theory make your application a little easier to work with.
And you can from then on work with UTF-8 as your base model.

But there's no way that I know of, to convert your representation into a ASCII ö, because there really isn't such a thing. There's just different ways Python will do this encoding magic for you to make you believe it's possible to "just write ö".

In Python3 most of the strings you encounter will either be bytes data or treated a bit differently from Python2. And for the most part it's a lot easier.

There's numerous ways to change the encoding that is not part of the standard praxis. But there are ways to do it.
The closest to "good" praxis, would be the locale:

locale.setlocale(locale.LC_ALL, 'sv_SE.UTF-8')

I also had a horrendous solution and approach to this years back, it looked something like this (it was a great bodge for me at the time):

Python - Encoding string - Swedish Letters

tl;dr:

Your code usually assume/use ASCII as it's encoder/decoder.
ö is not a part of ASCII, there for you'll always see \xf6 if you've some how gotten unicode characters. Normally, if you print u'Ganztags ge\xf6ffnet' it will be shown as a Ö because of automatic encoding, if you need to verify if input matches that string, you have to compare them u'ö' == u'ö', if other systems depend on this data, encode it with something they understand .encode('UTF-8'). But replacing \xf6 with ö is the same thing, just that ö doesn't exist in ASCII and you need to do u'ö' - which, will result in the same data at the end.

Character `ö` has code `\xf6` in both unicode and 8 bits Latin1 encoding also knows as iso-8859-1. Python have support for different 8bits encoding, and Latin1 is common for west european languages including Deutsch. — Serge Ballesta, Jan 18 '17 at 16:54
@SergeBallesta That's the extended ASCII table you're talking about now right? Because the default `128` bit table (I think it is) doesn't. Python has support for for roughly 30 or so encodings, the default however, does not support `ö`. Latin1 != ASCII. — Torxed, Jan 18 '17 at 16:56
I'm not speaking of any *extended ASCII* but of Python 8bit strings. `encode` and `decode` methods allow to transform **8bits** strings from and to unicode respectively in Python2 and Python3. Simply programmer must explicitely declare the charset. — Serge Ballesta, Jan 18 '17 at 17:00
@SergeBallesta that's exactly the point I tried to get across with encode and decode. However, standard python strings are 7bit if I'm not mistaken? — Torxed, Jan 18 '17 at 18:07
Python 2 strings are *byte* strings, that is 8 bits strings. Simply the default encoding is 7 bits ASCII. — Serge Ballesta, Jan 18 '17 at 22:08
And in 7-bit ASCII, there is no such thing as `ö`. Assuming LSB, `2^7=128`. The value of `ö` in this case is `246`. There for it's not possible with the default encoding to parse `ö` - This is what I'm trying to get across here. I'm not sure if you're trying to enlighten something to others reading this or if you're trying to lecture me somehow that I'm wrong.. Either way, please make it clear. I'm assuming the later and if that's the case, you're wrong (I think). Python 2 are byte string literals at run-time yes, and in Python3 they're Unicode with default UTF-8 for basically everything. — Torxed, Jan 18 '17 at 22:19

score 0 · Answer 2 · answered Jan 18 '17 at 17:18

As you are using German language, you should be aware of non ascii characters. You know whether your system prefers Latin1 (Windows console and some Unixes), UTF8 (most Linux variants), or native unicode (Windows GUI).

If you can process everything as native unicode things are cleaner and you should just accept the fact that u'ö' and u'\xf6' are the same character - the latter is simply independant of the python source file charset.

If you have to output byte strings of store them in files, you should encode them in UTF8 (can process any unicode character but characters of code above 127 use more than 1 byte) or Latin1 (one byte per character, but only supports unicode code point below 256)

In that case just use an explicit encoding to convert your unicode strings to byte strings:

print u'Ganztags ge\xf6ffnet'.encode('Latin1') # or .encode('utf8')

should give what you expect.

encoding issue. Replace special character

2 Answers2

tl;dr: