Double-decoding unicode in python

Question

I am working against an application that seems keen on returning, what I believe to be, double UTF-8 encoded strings.

I send the string u'XüYß' encoded using UTF-8, thus becoming X\u00fcY\u00df (equal to X\xc3\xbcY\xc3\x9f).

The server should simply echo what I sent it, yet returns the following: X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f (should be X\xc3\xbcY\xc3\x9f). If I decode it using str.decode('utf-8') becomes u'X\xc3\xbcY\xc3\x9f', which looks like a ... unicode-string, containing the original string encoded using UTF-8.

But Python won't let me decode a unicode string without re-encoding it first - which fails for some reason, that escapes me:

>>> ret = 'X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f'.decode('utf-8')
>>> ret
u'X\xc3\xbcY\xc3\x9f'
>>> ret.decode('utf-8')
# Throws UnicodeEncodeError: 'ascii' codec can't encode ...

How do I persuade Python to re-decode the string? - and/or is there any (practical) way of debugging what's actually in the strings, without passing it though all the implicit conversion print uses?

(And yes, I have reported this behaviour with the developers of the server-side.)

score 27 · Accepted Answer · 2017-04-18T13:25:04.447

ret.decode() tries implicitly to encode ret with the system encoding - in your case ascii.

If you explicitly encode the unicode string, you should be fine. There is a builtin encoding that does what you need:

>>> 'X\xc3\xbcY\xc3\x9f'.encode('raw_unicode_escape').decode('utf-8')
'XüYß'

Really, .encode('latin1') (or cp1252) would be OK, because that's what the server is almost cerainly using. The raw_unicode_escape codec will just give you something recognizable at the end instead of raising an exception:

>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'

>>> '€\xe2\x82\xac'.encode('latin1').decode('utf8')
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'latin-1' codec can't encode character '\u20ac' in position 0: ordinal not in range(256)

In case you run into this sort of mixed data, you can use the codec again, to normalize everything:

>>> '€\xe2\x82\xac'.encode('raw_unicode_escape').decode('utf8')
'\\u20ac€'

>>> '\\u20ac€'.encode('raw_unicode_escape')
b'\\u20ac\\u20ac'
>>> '\\u20ac€'.encode('raw_unicode_escape').decode('raw_unicode_escape')
'€€'

**Whew** - don't need to use my scary thing. – Chris Morgan Nov 24 '10 at 13:37 — Chris Morgan, Nov 24 '10 at 13:37
Don't know how, but it worked! – Smit Johnth Dec 04 '22 at 11:20 — Smit Johnth, Dec 04 '22 at 11:20

score 3 · Answer 2 · answered Nov 24 '10 at 13:37

3

What you want is the encoding where Unicode code point X is encoded to the same byte value X. For code points inside 0-255 you have this in the latin-1 encoding:

def double_decode(bstr):
    return bstr.decode("utf-8").encode("latin-1").decode("utf-8")

answered Nov 24 '10 at 13:37

u0b34a0f6ae

48,117
14
92
101

score 0 · Answer 3 · edited May 23 '17 at 12:09

0

Don't use this! Use @hop's solution.

My nasty hack: (cringe! but quietly. It's not my fault, it's the server developers' fault)

def double_decode_unicode(s, encoding='utf-8'):
    return ''.join(chr(ord(c)) for c in s.decode(encoding)).decode(encoding)

Then,

>>> double_decode_unicode('X\xc3\x83\xc2\xbcY\xc3\x83\xc2\x9f')
u'X\xfcY\xdf'
>>> print _
XüYß

edited May 23 '17 at 12:09

Community

1
1

answered Nov 24 '10 at 13:29

Chris Morgan

86,207
24
208
215

Great question, by the way. A nasty situation. I hope someone else can come up with a neater solution than `chr(ord(c))` to convert unicode to str, character by character... – Chris Morgan Nov 24 '10 at 13:30
`f(char) for char in string` cries for an encoding. – Nov 24 '10 at 13:33
transforming each character of string in sequence via some function is the very definition of encoding and decoding, that's how. – Nov 24 '10 at 13:44
@hop: naturally, but as a solution this looks ghastly. Your `.encode('raw_unicode_escape')` is much cleaner (quite aside from the fact that the unicode->str step of your solution is over six times as fast as mine). – Chris Morgan Nov 24 '10 at 13:52

score 0 · Answer 4 · answered Oct 12 '11 at 22:00

0

Here's a little script that might help you, doubledecode.py -- https://gist.github.com/1282752

answered Oct 12 '11 at 22:00

s29

2,027
25
20

Double-decoding unicode in python

4 Answers4

Linked