3

In Python 2, the function json.dumps() will ensure that all non-ascii characters are escaped as \uxxxx.

Python 2 Json

But isn't this quite confusing because \uxxxx is a unicode character and should be used inside a unicode string.

The output of json.dumps() is a str, which is a byte string in Python 2. And thus shouldn't it escape characters as \xhh ?

>>> unicode_string = u"\u00f8"
>>> print unicode_string
ø
>>> print json.dumps(unicode_string)
"\u00f8"
>>> unicode_string.encode("utf8")
'\xc3\xb8'
Kartik Anand
  • 4,513
  • 5
  • 41
  • 72

3 Answers3

5

Why does json.dumps escape non-ascii characters with “\uxxxx”

Python 2 may mix ascii-only bytestrings and Unicode strings together.

It might be a premature optimization. Unicode strings may require 2-4 times more memory than corresponding bytestrings if they contain characters mostly in ASCII range in Python 2.

Also, even today, print(unicode_string) may easily fail if it contains non-ascii characters while printing to Windows console unless something like win-unicode-console Python package is installed. It may fail even on Unix if C/POSIX locale (default for init.d services, ssh, cron in many cases) is used (that implies ascii character encoding. There is C.UTF-8 but it is not always available and you have to configure it explicitly). It might explain why you might want ensure_ascii=True in some cases.

JSON format is defined for Unicode text and therefore strictly speaking json.dumps() should always return a Unicode string but it may return a bytestring if all characters are in ASCII range (xml.etree.ElementTree has similar "optimization"). It is confusing that Python 2 allows to treat an ascii-only bytestring as a Unicode string in some cases (implicit conversions are allowed). Python 3 is more strict (implicit conversions are forbidden).

ASCII-only bytestrings might be used instead of Unicode strings (with possible non-ASCII characters) to save memory and/or improve interoperability in Python 2.

To disable that behavior, use json.dumps(obj, ensure_ascii=False).


It is important to avoid confusing a Unicode string with its representation in Python source code as Python string literal or its representation in a file as JSON text.

JSON format allows to escape any character, not just Unicode characters outside ASCII range:

>>> import json
>>> json.loads(r'"\u0061"')
u'a'
>>> json.loads('"a"')
u'a'

Don't confuse it with escapes in Python string literals used in Python source code. u"\u00f8" is a single Unicode character but "\u00f8" in the output is eight characters (in Python source code, you could right it as r'"\u00f8"' == '"\\u00f8"' == u'"\\u00f8"' (backslash is special in both Python literals and json text -- double escaping may happen). Also there are no \x escapes in JSON:

>>> json.loads(r'"\x61"') # invalid JSON
Traceback (most recent call last):
...
ValueError: Invalid \escape: line 1 column 2 (char 1)
>>> r'"\x61"' # valid Python literal (6 characters)
'"\\x61"'
>>> '"\x61"'  # valid Python literal with escape sequence (3 characters)
'"a"'

The output of json.dumps() is a str, which is a byte string in Python 2. And thus shouldn't it escape characters as \xhh ?

json.dumps(obj, ensure_ascii=True) produces only printable ascii characters and therefore print repr(json.dumps(u"\xf8")) won't contain \xhh escapes that are used to represent (repr()) non-printable chars (bytes).

\u escapes can be necessary even for ascii-only input:

#!/usr/bin/env python2
import json
print json.dumps(map(unichr, range(128)))

Output

["\u0000", "\u0001", "\u0002", "\u0003", "\u0004", "\u0005", "\u0006", "\u0007",
"\b", "\t", "\n", "\u000b", "\f", "\r", "\u000e", "\u000f", "\u0010", "\u0011",
"\u0012", "\u0013", "\u0014", "\u0015", "\u0016", "\u0017", "\u0018", "\u0019",
"\u001a", "\u001b", "\u001c", "\u001d", "\u001e", "\u001f", " ", "!", "\"", "#",
"$", "%", "&", "'", "(", ")", "*", "+", ",", "-", ".", "/", "0", "1", "2", "3",
"4", "5", "6", "7", "8", "9", ":", ";", "<", "=", ">", "?", "@", "A", "B", "C",
"D", "E", "F", "G", "H", "I", "J", "K", "L", "M", "N", "O", "P", "Q", "R", "S",
"T", "U", "V", "W", "X", "Y", "Z", "[", "\\", "]", "^", "_", "`", "a", "b", "c",
"d", "e", "f", "g", "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s",
"t", "u", "v", "w", "x", "y", "z", "{", "|", "}", "~", "\u007f"]

But isn't this quite confusing because \uxxxx is a unicode character and should be used inside a unicode string

\uxxxx are 6 characters that may be interpreted as a single character in some contexts e.g., in Python source code u"\uxxxx" is a Python literal that creates a Unicode string in memory with a single Unicode character. But if you see \uxxxx in a json text; it is six characters that may represent a single Unicode character if you load it (json.loads()).

At this point, you should understand why len(json.loads('"\\\\"')) == 1.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • »Also, even today, print(unicode_string) may easily fail if it contains non-ascii characters« is probably the main point here. I'd dismiss memory usage, as most people don't routinely generate, parse, or manipulate tens of Gibibytes of JSON. – Joey Sep 12 '15 at 18:30
  • @Joey: Other things being equal, an algorithm that uses less memory is faster as a rule. I've already said that It can be considered a *premature optimization*. Printing json to console is useful for debugging (and debugging, human-readability of the format are important) but it is not hard to use utf-8 in cases where json is typically used (data exchange between programs (perhaps on different computers)) and therefore the **default should be Unicode**. – jfs Sep 12 '15 at 21:47
2

The \u in "\u00f8" isn't actually an escape sequence like \x. The \u is a literal r'\u'. But such byte strings can easily be converted to Unicode.

Demo:

s = "\u00f8"
u = s.decode('unicode-escape')
print repr(s), len(s), repr(u), len(u)

s = "\u2122"
u = s.decode('unicode-escape')
print repr(s), len(s), repr(u), len(u)

output

'\\u00f8' 6 u'\xf8' 1
'\\u2122' 6 u'\u2122' 1

As J.F.Sebastian mentions in the comments, inside a Unicode string \u00f8 is a true escape code, i.e., in a Python 3 string or in a Python 2 u"\u00f8" string. Also take heed of his other remarks!

PM 2Ring
  • 54,345
  • 6
  • 82
  • 182
  • do not use `.decode('unicode-escape')` on JSON text, use `json.loads(json_text)` instead. – jfs Sep 12 '15 at 12:33
  • @J.F.Sebastian: I wasn't actually suggesting the use of .decode('unicode-escape') on JSON text, merely illustrating that `"\u00f8"` etc aren't escape sequences in the sense that `'\xf8'` and `'\n'` are. And showing how to handle such sequences in a non-JSON context. Also note that the OP is talking about JSON _output_; they don't mention anything in the question about handling JSON _input_. – PM 2Ring Sep 12 '15 at 12:59
  • `\u00f8` **is** an escape sequence inside both Python unicode literal and inside a JSON text. You should not use `.decode('unicode-escape')` to fix broken bytestring literal, you should use a Unicode literal (`u''`) instead. If `\uxxxx` arrives in a variable then either json format should be used or `ast.literal_eval()` if input is a Python unicode literal and it is impossible to fix the upstream data source. It is misleading to recommend `.decode('unicode-escape')` to a beginner. – jfs Sep 12 '15 at 14:16
  • @J.F.Sebastian: Fair enough. Is that better? – PM 2Ring Sep 12 '15 at 14:53
1

That's exactly the point. You get a byte string back, not a Unicode string. Thus the Unicode characters need to be escaped to survive. The escaping is allowed by JSON and thus presents a safe way of representing Unicode characters.

Joey
  • 344,408
  • 85
  • 689
  • 683
  • Yeah I get the point about the byte string. But is \u escape allowed in byte strings as well? Because I only saw \x escapes being used in byte string? – Kartik Anand Sep 05 '15 at 10:27
  • 1
    It's not a *Python* escape. It's a *JSON* escape. – Joey Sep 05 '15 at 10:56
  • 4
    If you don't like it when your JSON file contains those `\uXXXX` sequences, you can use `print json.dumps(unicode_string, ensure_ascii=False)`, so it returns an `unicode` rather than a byte string. I'd consider those escape sequences a legacy system for dealing with a system which mangles non-ASCII characters. – roeland Sep 06 '15 at 23:11
  • 1
    this explains why you have to escape non-ascii characters *if* the result is a bytestring *and* you want interoperability in the environment that does not accept non-ascii bytes. It doesn't explain why you can't use utf-8 instead of ascii. It doesn't explain why `json.dumps()` would return a bytestring in the first place. `json.dumps()` *can* and *should* return a Unicode string instead. [My guess: a bytestring is used to save memory in Python 2](http://stackoverflow.com/a/32539609/4279) – jfs Sep 12 '15 at 14:30