In parsing an HTML response to extract data with Python 3.4 on Kubuntu 15.10 in the Bash CLI, using print()
I am getting output that looks like this:
\u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
How would I output the actual text itself in my application?
This is the code generating the string:
response = requests.get(url)
messages = json.loads( extract_json(response.text) )
for k,v in messages.items():
for message in v['foo']['bar']:
print("\nFoobar: %s" % (message['body'],))
Here is the function which returns the JSON from the HTML page:
def extract_json(input_):
"""
Get the JSON out of a webpage.
The line of interest looks like this:
foobar = ["{\"name\":\"dotan\",\"age\":38}"]
"""
for line in input_.split('\n'):
if 'foobar' in line:
return line[line.find('"')+1:-2].replace(r'\"',r'"')
return None
In googling the issue, I've found quite a bit of information relating to Python 2, however Python 3 has completely changed how strings and especially Unicode are handled in Python.
How can I convert the example string (\u05ea
) to characters (ת
) in Python 3?
Addendum:
Here is some information regarding message['body']
:
print(type(message['body']))
# Prints: <class 'str'>
print(message['body'])
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
print(repr(message['body']))
# Prints: '\\u05ea\u05d4 \\u05e0\\u05e9\\u05de\\u05e2 \\u05de\\u05e6\\u05d5\\u05d9\\u05df'
print(message['body'].encode().decode())
# Prints: \u05ea\u05d4 \u05e0\u05e9\u05de\u05e2 \u05de\u05e6\u05d5\u05d9\u05df
print(message['body'].encode().decode('unicode-escape'))
# Prints: תה נשמע מצוין
Note that the last line does work as expected, but it has a few issues:
- Decoding string literals with unicode-escape is the wrong thing as Python escapes are different to JSON escapes for many characters. (Thank you bobince)
encode()
relies on the default encoding, which is a bad thing.(Thank you bobince)- The
encode()
fails on some newer Unicode characters, such as \ud83d\ude03, with UnicodeEncodeError "surrogates not allowed".