Why are some characters converted to \u notation when displayed by pprint when using utf8?

Question

Here is a console demonstration:

>>> x = "a b"
>>> x
'a\u200ab'
>>> repr( x )
"'a\\u200ab'"

So it seems pprint is using the same technology as printing strings does.

Admittedly the white space character between a & b in the initial value bound to x is, indeed U+200a. But when using UTF-8 input and output encodings, why would any characters be converted to \u notation for output?

Question 2, of course, is how can one learn what is the whole set of characters are converted in that manner?

Question 3, of course, is how can one suppress that behavior?

PM 2Ring · Answer 1 · 2017-01-01T06:58:17.310

3

pprint prints the representation of the object you pass it. From the docs

The pprint module provides a capability to “pretty-print” arbitrary Python data structures in a form which can be used as input to the interpreter.

And "a form which can be used as input to the interpreter" means you get the object's representation, i.e., what its __repr__ method returns.

If you want strings to be printed using their __str__ method instead of their __repr__ then don't use pprint.

Here's a Python 3 code snippet that looks for chars that get represented using a \u escape code:

for i in range(1500):
    c = chr(i)
    r = repr(c)
    if r'\u' in r:
        print('{0:4} {0:04x} {1} {2}'.format(i, r, c))

output

 888 0378 '\u0378' ͸
 889 0379 '\u0379' ͹
 896 0380 '\u0380' ΀
 897 0381 '\u0381' ΁
 898 0382 '\u0382' ΂
 899 0383 '\u0383' ΃
 907 038b '\u038b' ΋
 909 038d '\u038d' ΍
 930 03a2 '\u03a2' ΢
1328 0530 '\u0530' ԰
1367 0557 '\u0557' ՗
1368 0558 '\u0558' ՘
1376 0560 '\u0560' ՠ
1416 0588 '\u0588' ֈ
1419 058b '\u058b' ֋
1420 058c '\u058c' ֌
1424 0590 '\u0590' ֐
1480 05c8 '\u05c8' ׈
1481 05c9 '\u05c9' ׉
1482 05ca '\u05ca' ׊
1483 05cb '\u05cb' ׋
1484 05cc '\u05cc' ׌
1485 05cd '\u05cd' ׍
1486 05ce '\u05ce' ׎
1487 05cf '\u05cf' ׏

Note that codepoints > 0xffff get represented using a \U escape code, when necessary.

for i in range(65535, 65600):
    c = chr(i)
    r = repr(c)
    if r'\u' in r.lower():
        print('{0:4} {0:04x} {1} {2}'.format(i, r, c))

output

65535 ffff '\uffff' �
65548 1000c '\U0001000c' 
65575 10027 '\U00010027' 
65595 1003b '\U0001003b' 
65598 1003e '\U0001003e'

edited Jan 01 '17 at 06:58

answered Jan 01 '17 at 06:12

PM 2Ring

54,345
6
82
182

Clever to write code to look for them, which could answer question 2 nicely, but not the others. Perhaps analysis of the complete list would give clues. Certainly if the codes are not defined as characters by Unicode, one would expect \u notation, but for defined characters, I was surprised. – Victoria Jan 02 '17 at 00:08
@Victoria You shouldn't be surprised that \x, \u and \U notation is used in the repr of a string. The repr of an object needs to be robust & unambiguous. It's designed to be used by programmers, eg in source code & directly in the interpreter. It's _not_ supposed to be displayed to the user: they should only see properly formatted output created using the `__str__` method of the string, eg what `print(my_string)` displays. – PM 2Ring Jan 02 '17 at 04:19
1

@Victoria (cont) For further info on this important topic please take a look at the discussions of `__str__` vs `__repr__` in the docs, including the tutorial. Also see [here](http://stackoverflow.com/questions/1436703/difference-between-str-and-repr-in-python) and the relevant linked pages. – PM 2Ring Jan 02 '17 at 04:21
Your link, and some further searching based on keywords found in them finally let me to the answers. Thanks for contributing. – Victoria Jan 02 '17 at 05:31
@Victoria You may find the [unicodedata](https://docs.python.org/3/library/unicodedata.html) module of interest. In particular, the `category` function is useful for determining what a particular char is for, and the `name` function returns the char's official name. Note that you can use Unicode names in strings, using a `\N{name}` escape sequence. – PM 2Ring Jan 02 '17 at 11:46

score 1 · Accepted Answer · edited Mar 28 '22 at 08:04

I finally found the documentation that explains it. From Python Unicode documentation:

int Py_UNICODE_ISPRINTABLE(Py_UNICODE ch)

Return 1 or 0 depending on whether ch is a printable character. Nonprintable characters are those characters defined in the Unicode character database as “Other” or “Separator”, excepting the ASCII space (0x20) which is considered printable. (Note that printable characters in this context are those which should not be escaped when repr() is invoked on a string. It has no bearing on the handling of strings written to sys.stdout or sys.stderr.)

It partly answers the first question (the fact, not the reason why), and leads to the exact answer for Question 2.

Unicode space separator characters

I suppose the desire to be visually unambiguous is the reason for the fact... all those separator characters look "the same" (white space). That might be important if you are examining a paper log, but if examining it online, copy/pasting to a hex display tool, or to This wonderfully helpful Unicode decoder is certainly sufficient, without interrupting the flow of the text when the details of which separator was used is not important (which, in my opinion, is most of the non-paper time).

Question 3 can apparently be done in one of two ways: Creating a subclass of str with a different repr (disrupts existing code) or creating a subclass of pprint with a format function that avoids calling repr for str, but just includes the value directly.

Why are some characters converted to \u notation when displayed by pprint when using utf8?

2 Answers2