json.dumps \u escaped unicode to utf8

Question

I came from this old discussion, but the solution didn't help much as my original data was encoded differently:

My original data was already encoded in unicode, I need to output as UTF-8

data={"content":u"\u4f60\u597d"}

When I try to convert to utf:

json.dumps(data, indent=1, ensure_ascii=False).encode("utf8")

the output I get is "content": "ä½ å¥½" and the expected out put should be "content": "你好"

I tried without ensure_ascii=false and the output becomes plain unescaped "content": "\u4f60\u597d"

How can I convert the previously \u escaped json to UTF-8?

You are reading your UTF-8 data in the wrong codec. You **have** UTF-8, but are decoding it as Latin-1 or CP1252. In other words, this is not a Python problem. — Martijn Pieters, Jul 27 '16 at 18:22
Yeah, I was unable to repreoduce this problem in the Python 3 REPL. — David Grayson, Jul 27 '16 at 18:25

score 9 · Accepted Answer · edited Jul 27 '16 at 18:40

You have UTF-8 JSON data:

>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> json.dumps(data, indent=1, ensure_ascii=False)
u'{\n "content": "\u4f60\u597d"\n}'
>>> json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
{
 "content": "你好"
}

My terminal just happens to be configured to handle UTF-8, so printing the UTF-8 bytes to my terminal produced the desired output.

However, if your terminal is not set up for such output, it is your terminal that then shows 'wrong' characters:

>>> print json.dumps(data, indent=1,  ensure_ascii=False).encode('utf8').decode('latin1')
{
 "content": "ä½ å¥½"
}

Note how I decoded the data to Latin-1 to deliberately mis-read the UTF-8 bytes.

This isn't a Python problem; this is a problem with how you are handling the UTF-8 bytes in whatever tool you used to read these bytes.

Thank you, it was my browser that's acting up. I thought the `ä½ å¥½` was encoding error on Python end. Turns out it's the output :) — Bonk, Jul 27 '16 at 18:39
@Bonk: perhaps you need to set a proper response header? `Content-Type: application/json` should be enough (as the JSON standard specifies that UTF is the default, with a BOM at the start making it possible to distinguish UTF-8 from UTF-16 and UTF-32), or include the charset explicitly with `Content-Type: application/json; charset=utf8`. Without a `Content-Type` header or with one set to a `text/..` mimetype the browser may well default to Latin-1. — Martijn Pieters, Jul 27 '16 at 18:41

score 4 · Answer 2 · answered May 28 '18 at 03:17

in python2, it works; however in python3 print will output like:

>>> b'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'

do not use encode('utf8'):

>>> print(json.dumps(data, indent=1, ensure_ascii=False))
{
 "content": "你好"
}

or use sys.stdout.buffer.write instead of print:

>>> import sys
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> sys.stdout.buffer.write(json.dumps(data, indent=1, 
ensure_ascii=False).encode('utf8') + b'\n')
{
 "content": "你好"
}

see Write UTF-8 to stdout, regardless of the console's encoding

json.dumps \u escaped unicode to utf8

2 Answers2

Linked