24

I've got a problem with strings that I get from one of my clients over xmlrpc. He sends me utf8 strings that are encoded twice :( so when I get them in python I have an unicode object that has to be decoded one more time, but obviously python doesn't allow that. I've noticed my client however I need to do quick workaround for now before he fixes it.

Raw string from tcp dump:

<string>Rafa\xc3\x85\xc2\x82</string>

this is converted into:

u'Rafa\xc5\x82'

The best we get is:

eval(repr(u'Rafa\xc5\x82')[1:]).decode("utf8") 

This results in correct string which is:

u'Rafa\u0142' 

this works however is ugly as hell and cannot be used in production code. If anyone knows how to fix this problem in more suitable way please write. Thanks, Chris

Alan Moore
  • 73,866
  • 12
  • 100
  • 156
Chris Ciesielski
  • 1,203
  • 2
  • 10
  • 19

3 Answers3

47
>>> s = u'Rafa\xc5\x82'
>>> s.encode('raw_unicode_escape').decode('utf-8')
u'Rafa\u0142'
>>>
Ivan Baldin
  • 3,391
  • 3
  • 22
  • 14
  • 2
    @partisann: Neat! I didn't know about raw_unicode_escape (obviously 8-) – RichieHindle Jul 24 '09 at 13:17
  • Thanks partisann, I haven't know about it neither. – Chris Ciesielski Jul 27 '09 at 09:10
  • 1
    May your reputation rise beyond expectation, even after all those years! :) – Marian Apr 22 '13 at 17:18
  • Seems you're don't answering the question, you do not start from a doubly encoded utf-8 string and it fails with the Euro symbol: python -c 'import sys; print sys.argv[1].encode("raw_unicode_escape")' $'\xc3\xa2\xc2\x82\xc2\xac' ordinal not in range(128) – Julien Palard Jun 03 '14 at 15:29
  • 1
    @JulienPalard In python 2.x you have to manually decode `str` object to get the unicode string. Fix for 2.x: `[etc.]argv[1].decode("utf-8").encode("raw_[etc.]`. In 3.x `str` is already unicode and interpreter automatically decodes it from system's default encoding. Fix for 3.x: parenthesis around print, run with python3. – Ivan Baldin Jun 16 '14 at 14:03
4

Yow, that was fun!

>>> original = "Rafa\xc3\x85\xc2\x82"
>>> first_decode = original.decode('utf-8')
>>> as_chars = ''.join([chr(ord(x)) for x in first_decode])
>>> result = as_chars.decode('utf-8')
>>> result
u'Rafa\u0142'

So you do the first decode, getting a Unicode string where each character is actually a UTF-8 byte value. You go via the integer value of each of those characters to get back to a genuine UTF-8 string, which you then decode as normal.

RichieHindle
  • 272,464
  • 47
  • 358
  • 399
2
>>> weird = u'Rafa\xc5\x82'
>>> weird.encode('latin1').decode('utf8')
u'Rafa\u0142'
>>>

latin1 is just an abbreviation for Richie's nuts'n'bolts method.

It is very curious that the seriously under-described raw_unicode_escape codec gives the same result as latin1 in this case. Do they always give the same result? If so, why have such a codec? If not, it would preferable to know for sure exactly how the OP's client did the transformation from 'Rafa\xc5\x82' to u'Rafa\xc5\x82' and then to reverse that process exactly -- otherwise we might come unstuck if different data crops up before the double encoding is fixed.

John Machin
  • 81,303
  • 11
  • 141
  • 189
  • 2
    When your string contains only codepoints 0-255, it's always the same. The differences is characters above that; raw_unicode_escape will escape them, eg. \u1234, where latin1 will throw UnicodeEncodeError. (Decoding has the symmetric difference--raw_unicode_escape decodes \u1234 escapes, latin1 does not, but it's only encoding here.) They're equivalent here, but I'd stick with latin1, since this has nothing to do with escaping and latin1 is a more widely understood encoding. – Glenn Maynard Jul 24 '09 at 18:58
  • Thanks Glenn, thinking about backslashes after midnight turned my brain into a pumpkin :-) – John Machin Jul 24 '09 at 22:52