Why does json.dumps escape non-ascii characters with “\uxxxx”
Python 2 may mix ascii-only bytestrings and Unicode strings together.
It might be a premature optimization. Unicode strings may require 2-4 times more memory than corresponding bytestrings if they contain characters mostly in ASCII range in Python 2.
Also, even today, print(unicode_string)
may easily fail if it contains non-ascii characters while printing to Windows console unless something like win-unicode-console
Python package is installed. It may fail even on Unix if C/POSIX locale (default for init.d
services, ssh
, cron
in many cases) is used (that implies ascii character encoding. There is C.UTF-8
but it is not always available and you have to configure it explicitly). It might explain why you might want ensure_ascii=True
in some cases.
JSON format is defined for Unicode text and therefore strictly speaking json.dumps()
should always return a Unicode string but it may return a bytestring if all characters are in ASCII range (xml.etree.ElementTree
has similar "optimization"). It is confusing that Python 2 allows to treat an ascii-only bytestring as a Unicode string in some cases (implicit conversions are allowed). Python 3 is more strict (implicit conversions are forbidden).
ASCII-only bytestrings might be used instead of Unicode strings (with possible non-ASCII characters) to save memory and/or improve interoperability in Python 2.
To disable that behavior, use json.dumps(obj, ensure_ascii=False)
.
It is important to avoid confusing a Unicode string with its representation in Python source code as Python string literal or its representation in a file as JSON text.
JSON format allows to escape any character, not just Unicode characters outside ASCII range:
>>> import json
>>> json.loads(r'"\u0061"')
u'a'
>>> json.loads('"a"')
u'a'
Don't confuse it with escapes in Python string literals used in Python source code. u"\u00f8"
is a single Unicode character but "\u00f8"
in the output is eight characters (in Python source code, you could right it as r'"\u00f8"' == '"\\u00f8"' == u'"\\u00f8"'
(backslash is special in both Python literals and json text -- double escaping may happen). Also there are no \x
escapes in JSON:
>>> json.loads(r'"\x61"') # invalid JSON
Traceback (most recent call last):
...
ValueError: Invalid \escape: line 1 column 2 (char 1)
>>> r'"\x61"' # valid Python literal (6 characters)
'"\\x61"'
>>> '"\x61"' # valid Python literal with escape sequence (3 characters)
'"a"'
The output of json.dumps() is a str, which is a byte string in Python 2. And thus shouldn't it escape characters as \xhh ?
json.dumps(obj, ensure_ascii=True)
produces only printable ascii characters and therefore print repr(json.dumps(u"\xf8"))
won't contain \xhh
escapes that are used to represent (repr()
) non-printable chars (bytes).
\u
escapes can be necessary even for ascii-only input:
#!/usr/bin/env python2
import json
print json.dumps(map(unichr, range(128)))
Output
["\u0000", "\u0001", "\u0002", "\u0003", "\u0004", "\u0005", "\u0006", "\u0007",
"\b", "\t", "\n", "\u000b", "\f", "\r", "\u000e", "\u000f", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018", "\u0019",
"\u001a", "\u001b", "\u001c", "\u001d", "\u001e", "\u001f", " ", "!", "\"", "#",
"$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", "0", "1", "2", "3",
"4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?", "@", "A", "B", "C",
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S",
"T", "U", "V", "W", "X", "Y", "Z", "[", "\\", "]", "^", "_", "`", "a", "b", "c",
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s",
"t", "u", "v", "w", "x", "y", "z", "{", "|", "}", "~", "\u007f"]
But isn't this quite confusing because \uxxxx is a unicode character and should be used inside a unicode string
\uxxxx
are 6 characters that may be interpreted as a single character in some contexts e.g., in Python source code u"\uxxxx"
is a Python literal that creates a Unicode string in memory with a single Unicode character. But if you see \uxxxx
in a json text; it is six characters that may represent a single Unicode character if you load it (json.loads()
).
At this point, you should understand why len(json.loads('"\\\\"')) == 1
.