2

In parsing an HTML response to extract data with Python 3.4 on Kubuntu 15.10 in the Bash CLI, using print() I am getting output that looks like this:

\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

How would I output the actual text itself in my application?

This is the code generating the string:

response = requests.get(url)
messages = json.loads( extract_json(response.text) )

for k,v in messages.items():
    for message in v['foo']['bar']:
        print("\nFoobar: %s" % (message['body'],))

Here is the function which returns the JSON from the HTML page:

def extract_json(input_):

    """
    Get the JSON out of a webpage.
    The line of interest looks like this:
    foobar = ["{\"name\":\"dotan\",\"age\":38}"]
    """

    for line in input_.split('\n'):
        if 'foobar' in line:
            return line[line.find('"')+1:-2].replace(r'\"',r'"')

    return None

In googling the issue, I've found quite a bit of information relating to Python 2, however Python 3 has completely changed how strings and especially Unicode are handled in Python.

How can I convert the example string (\u05ea) to characters (ת) in Python 3?

Addendum:

Here is some information regarding message['body']:

print(type(message['body']))
# Prints: <class 'str'>

print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'

print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df

print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין

Note that the last line does work as expected, but it has a few issues:

  • Decoding string literals with unicode-escape is the wrong thing as Python escapes are different to JSON escapes for many characters. (Thank you bobince)
  • encode() relies on the default encoding, which is a bad thing.(Thank you bobince)
  • The encode() fails on some newer Unicode characters, such as \ud83d\ude03, with UnicodeEncodeError "surrogates not allowed".
ShadowRanger
  • 143,180
  • 12
  • 188
  • 271
dotancohen
  • 30,064
  • 36
  • 138
  • 197
  • what is `print(ascii(message['body']))`? Unrelated: use `messages = response.json()`. – jfs Nov 02 '15 at 18:46
  • If the input is not JSON then what is it? `print(response.content[:50])`; `print(response.headers['Content-Type'])`. Can you change the upstream format returned by the service? – jfs Nov 02 '15 at 20:39
  • it is not what I've asked. Run the code from the comment as is. – jfs Nov 03 '15 at 15:56
  • @J.F.Sebastian: `b'\r\n\n\n – dotancohen Nov 03 '15 at 16:16
  • now we are getting somewhere. Could you post *the real* code that you use to get `messages`? (between `requests.get()` and `json.loads()` including) – jfs Nov 03 '15 at 16:23
  • @J.F.Sebastian: **I've removed the reduced test case and posted the actual code in use.** – dotancohen Nov 03 '15 at 17:11
  • what happens if you run `print(u"\u05ea")` in the same environment where `print("\nFoobar:..)` is executed? What is `sys.stdout.encoding`, `sys.stdout.errors`? What is `print(ascii(line[:10]))` before `.replace(r'\"',r'"')`? Drop `str()` around `line`. – jfs Nov 03 '15 at 17:43
  • print(u"\u05ea"): `ת` | print("\u05ea"): `ת` | sys.stdout.encoding: `UTF-8` | sys.stdout.errors: `strict` | ascii(line[:10]): `'\t\t\t\t\tfooba'` | Thank you! – dotancohen Nov 03 '15 at 17:52
  • increase `:10` until you see `\u` in the output. Are you sure you are running `print(u"\u05ea")` in the *same* environment as `print("\nFoobar:..)` (to make sure add `print(u"\u05ea")` just before `print("\nFoobar:..)` in the code)? – jfs Nov 03 '15 at 17:59
  • @J.F.Sebastian: Yes, it is the same environment. I just copied and pasted into the code file and ran it along with the other tests. Here is the output for `print(ascii(line[350:450]))`: `'d\\" : \\"6104187972690232298\\", \\"body\\" : \\"\\\\u05e9\\\\u05dc\\\\u05d5\\\\u05dd \\\\u05dc\\\\u05d9\\\\u05d6\\\\u05d'` – dotancohen Nov 03 '15 at 18:20
  • Let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/94128/discussion-between-dotancohen-and-j-f-sebastian). – dotancohen Nov 03 '15 at 19:43

1 Answers1

2

It appears your input uses backslash as an escape character, you should unescape the text before passing it to json:

>>> foobar = '{\\"body\\": \\"\\\\u05e9\\"}'
>>> import re
>>> json_text = re.sub(r'\\(.)', r'\1', foobar) # unescape
>>> import json
>>> print(json.loads(json_text)['body'])
ש

Don't use 'unicode-escape' encoding on JSON text; it may produce different results:

>>> import json
>>> json_text = '["\\ud83d\\ude02"]'
>>> json.loads(json_text)
['']
>>> json_text.encode('ascii', 'strict').decode('unicode-escape') #XXX don't do it
'["\ud83d\ude02"]'

'' == '\U0001F602' is U+1F602 (FACE WITH TEARS OF JOY).

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • Thank you very, very much! Your patience in getting down to the root of the problem in the comments is very much appreciated. – dotancohen Nov 03 '15 at 19:42
  • @dotancohen I'm not sure whether I understand the actual answer to this question. To convert a unicode sequence to its string representation, we have to use JSON? Is that it? No encode/decode niftiness to solve this? – Bram Vanroy Sep 20 '18 at 09:37
  • Additionally, is this under-the-hood any different from `s.encode('utf-8').decode('unicode-escape')`? – Bram Vanroy Sep 20 '18 at 09:48
  • It's hard to explain in one comment what I'm aiming at, so please see [my separate question](https://stackoverflow.com/questions/52425315/converting-unicode-sequence-to-string-in-python3-but-allow-paths-in-string). – Bram Vanroy Sep 20 '18 at 18:26
  • @BramVanroy [reposted comment to fix typos] no. If you have a plain Unicode text already then you don't need to do anything with it. If you have a Unicode text in the JSON format then just use `result = json.loads(json_text)`. If you have a garbled input then try to fix it upstream; if you can't, use whatever is necessary to fix your particular broken input. Please, do note: `'\u2603'` and `r'\u2603'` are completely different things in Python (your question suggests that you do not see the difference). – jfs Sep 20 '18 at 18:37