3

I came from this old discussion, but the solution didn't help much as my original data was encoded differently:

My original data was already encoded in unicode, I need to output as UTF-8

data={"content":u"\u4f60\u597d"}

When I try to convert to utf:

json.dumps(data, indent=1, ensure_ascii=False).encode("utf8")

the output I get is "content": "ä½ å¥½" and the expected out put should be "content": "你好"

I tried without ensure_ascii=false and the output becomes plain unescaped "content": "\u4f60\u597d"

How can I convert the previously \u escaped json to UTF-8?

wjandrea
  • 28,235
  • 9
  • 60
  • 81
Bonk
  • 1,859
  • 9
  • 28
  • 46

2 Answers2

9

You have UTF-8 JSON data:

>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> json.dumps(data, indent=1, ensure_ascii=False)
u'{\n "content": "\u4f60\u597d"\n}'
>>> json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'
>>> print json.dumps(data, indent=1, ensure_ascii=False).encode('utf8')
{
 "content": "你好"
}

My terminal just happens to be configured to handle UTF-8, so printing the UTF-8 bytes to my terminal produced the desired output.

However, if your terminal is not set up for such output, it is your terminal that then shows 'wrong' characters:

>>> print json.dumps(data, indent=1,  ensure_ascii=False).encode('utf8').decode('latin1')
{
 "content": "你好"
}

Note how I decoded the data to Latin-1 to deliberately mis-read the UTF-8 bytes.

This isn't a Python problem; this is a problem with how you are handling the UTF-8 bytes in whatever tool you used to read these bytes.

Sunny Patel
  • 7,830
  • 2
  • 31
  • 46
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Thank you, it was my browser that's acting up. I thought the `ä½ å¥½` was encoding error on Python end. Turns out it's the output :) – Bonk Jul 27 '16 at 18:39
  • 1
    @Bonk: perhaps you need to set a proper response header? `Content-Type: application/json` should be enough (as the JSON standard specifies that UTF is the default, with a BOM at the start making it possible to distinguish UTF-8 from UTF-16 and UTF-32), or include the charset explicitly with `Content-Type: application/json; charset=utf8`. Without a `Content-Type` header or with one set to a `text/..` mimetype the browser may well default to Latin-1. – Martijn Pieters Jul 27 '16 at 18:41
4

in python2, it works; however in python3 print will output like:

>>> b'{\n "content": "\xe4\xbd\xa0\xe5\xa5\xbd"\n}'

do not use encode('utf8'):

>>> print(json.dumps(data, indent=1, ensure_ascii=False))
{
 "content": "你好"
}

or use sys.stdout.buffer.write instead of print:

>>> import sys
>>> import json
>>> data = {'content': u'\u4f60\u597d'}
>>> sys.stdout.buffer.write(json.dumps(data, indent=1, 
ensure_ascii=False).encode('utf8') + b'\n')
{
 "content": "你好"
}

see Write UTF-8 to stdout, regardless of the console's encoding