2

I have some problems with strange escaped unicode Strings. My script consumes a webservice via the request library and response.text contains the following unicode string:

 u'\\u003c? abc ?\\u003eDas Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von \xd6kosystemen abgeleitet.\\u003c? /abc ?\\u003e'

 **Updated** Martijn solution works with the upper one, but breaks with this one because of len="12"
 u'\\u003c?abc len="12"?\\u003eResilienz sollte als st\xe4ndiger Anpassungsprozess zwischen Systemen und der Umwelt begriffen werden.\\u003c? /abc ?\\u003e'

The response from the server looks something like this:

\u003c? abc ?\u003eDas Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von Ökosystemen abgeleitet.\u003c?dpf /sent ?\u003e

The problem are the double escaped unicode sequences like \u003c, \u003c normally represents a < char. \xd6 is correct and represents a german Ö. This double escaping totally messes up my unicode string :-)

I have found a similar problem at this post: Stack Overflow - Conversion of strings like \uXXXX in python

The solution, using string.decode('unicode-escape'), only seems to work if all unicode sequences would be escaped but not with mixed single and double escapes. Just replacing the double escapes with single ones gives me a corrupt unicode string.

The easiest and best solution would be to adjust the response encoding on the server side, but i have no access ...

Thank's for your help!!!

Community
  • 1
  • 1
hetsch
  • 1,508
  • 2
  • 12
  • 27
  • Out of curiosity, what's the content type header of those responses? – Martijn Pieters Dec 17 '12 at 20:14
  • Content-Type: text/plain; charset=UTF-8 – hetsch Dec 17 '12 at 20:28
  • @Martijn Pieters I tried the the string in the firebug console, the output seems to be correct: `var a = '\u003c? abc len="12" ?\u003eDas Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von Ökosystemen abgeleitet.\u003c?dpf /sent ?\u003e'; console.log(a);` Strange things... – hetsch Dec 17 '12 at 20:42
  • your second example, containing the extra quotes, make it invalid as a literal JSON value. See my updated answer. – Martijn Pieters Dec 17 '12 at 20:45

1 Answers1

7

I suspect the server is returning JSON strings. JSON uses the same escape sequence, and if you add quotes around the string json.loads() is perfectly happy to decode that example for you:

>>> txt = u'\\u003c? abc ?\\u003eDas Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von \xd6kosystemen abgeleitet.\\u003c? /abc ?\\u003e'
>>> content = txt.encode('utf8')
>>> content
'\\u003c? abc ?\\u003eDas Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von \xc3\x96kosystemen abgeleitet.\\u003c? /abc ?\\u003e'
>>> import json
>>> json.loads('"{0}"'.format(content))
u'<? abc ?>Das Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von \xd6kosystemen abgeleitet.<? /abc ?>'
>>> print json.loads('"{0}"'.format(content))
<? abc ?>Das Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von Ökosystemen abgeleitet.<? /abc ?>

Try using json.loads('"{0}"'.format(response.content)) to decode the response to Unicode.

Your updated version does contain quotes, a little vexing, since those would have to be escaped to be using in valid JSON. It probably is not JSON then, but some other form of escapes; Java and Ruby also use \uxxxx escapes. Next thing we can try is to use a regular expression to replace these:

import re

uescapes = re.compile(r'(?<!\\)\\u[0-9a-fA-F]{4}', re.UNICODE)
def uescape_decode(match): return match.group().decode('unicode_escape')

uescapes.sub(uescape_decode, response.text)

This regular expression will decode any \uxxxx match to it's unicode character equivalent, provided that it is not preceded by a \, which effectively escapes the escape; \\uxxxx is not going to be replaced.

The regular expression approach decodes your both examples (second decoded first):

>>> print uescapes.sub(uescape_decode, txt)
<?abc len="12"?>Resilienz sollte als ständiger Anpassungsprozess zwischen Systemen und der Umwelt begriffen werden.<? /abc ?>
>>> print uescapes.sub(uescape_decode, u'\\u003c? abc ?\\u003eDas Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von \xd6kosystemen abgeleitet.\\u003c? /abc ?\\u003e')
<? abc ?>Das Modell des Adaptiven Zyklus wurde aus vergleichenden Untersuchungen zur Dynamik von Ökosystemen abgeleitet.<? /abc ?>
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank's!!! That's a good hint! But I still have troubles with json.loads: ValueError: Extra data: line 1 column 57 - line 1 column 283 (char 57 Think I have to clean up some stuff. Think your answer solves my problem. If my stuff works out I'm happy to accept it. Thank you so far. – hetsch Dec 17 '12 at 16:56
  • @martin-pieters, your answer is simply correct. The problem is, that i provided a stripped down, simple version of the unicode example. With the simple version, your solution works great but it breaks with the updated one. – hetsch Dec 17 '12 at 19:44
  • @hetsch: Updated with a regex approach. – Martijn Pieters Dec 17 '12 at 20:39
  • @Clearquestionwithexamples: it happens to work in Python 2.7, provided you have a UCS-2 narrow build; in that case `uescapes.sub(uescape_decode, '\ud861\ude00')` produces `u'\U00028600'`. – Martijn Pieters Nov 11 '15 at 17:59
  • @Clearquestionwithexamples: on a UCS-4 build you end up with the surrogate pair as separate code units: `u'\ud861\ude00'`; encoding to UTF-16 then decoding again resolves those into their proper non-BMP codepoint. – Martijn Pieters Nov 11 '15 at 18:01