0

The Unicode standard describes how characters are represented by code points and contains a lot of tables listing characters and their corresponding code points:

0061    'a'; LATIN SMALL LETTER A
0062    'b'; LATIN SMALL LETTER B

From https://docs.python.org/2/howto/unicode.html#definitions

In Python, a character has two different display forms, and the two forms are equal:

u'中文' == u'\u4e2d\u6587'

Apparently, human want to read u'中文' instead of u'\u4e2d\u6587'. But in some situations in Python2, unicode only display as unicode points:

>>> print(u'\u4e2d\u6587')
中文
>>> print({u'\u4e2d\u6587': 1})
{u'\u4e2d\u6587': 1}
>>> print([u'\u4e2d\u6587', 1])
[u'\u4e2d\u6587', 1]

But there is no problem in Python3

>>> print({u'\u4e2d\u6587': 1})
{'中文': 1}
>>> print([u'\u4e2d\u6587', 1])
['中文', 1]

Here are my questions:

  • Can I tell Python which display form of unicode that I want?
  • Why there's no problem with Python3?
  • Is there a simple solution for Python2?

I haven't found a good solution in the following links:

Cloud
  • 2,859
  • 2
  • 20
  • 23
  • 1
    Don't confuse display forms with **escape notations to define a string value**. – Martijn Pieters Apr 03 '18 at 07:53
  • Basically, *don't print string representations*, print the strings themselves. That includes not printing a container object, not when presenting to an end-user. – Martijn Pieters Apr 03 '18 at 07:55
  • @MartijnPieters Can you explain `escape notations` in detail, I can't find any results after searching `escape notations` in Google. – Cloud Apr 03 '18 at 07:55
  • Last but not least, in Python 3 there are still strings for which the representation will show `\uxxxx` escape sequences because otherwise those values would not be printable. – Martijn Pieters Apr 03 '18 at 07:56
  • https://en.wikipedia.org/wiki/Escape_character#Programming_and_data_formats and https://en.wikipedia.org/wiki/Escape_sequences_in_C and https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals – Martijn Pieters Apr 03 '18 at 07:56
  • *But there is no problem in Python3*: try `{'Hello\u2028World\u0007'}` for an example that still uses string literal escape syntax. – Martijn Pieters Apr 03 '18 at 08:11
  • @MartijnPieters Is `u'\u4e2d\u6587'` the [escape sequences](https://en.wikipedia.org/wiki/Escape_sequences_in_C) and `u'中文' ` the escaped sequences? – Cloud Apr 03 '18 at 08:32
  • Both define the same Unicode value, the first using escape sequences, the second using bytes that Python then decodes back to the Unicode codepoints, transparently, when compiling. The latter requires that Python knows what codec was used, in Python 2 source code that's by ASCII by default, UTF-8 for Python 3, but you can [set a different codec at the top of a source file](https://www.python.org/dev/peps/pep-0263/). In the terminal or console, Python picks it up from the configured locale. – Martijn Pieters Apr 03 '18 at 08:39
  • @MartijnPieters Python3 will escape unicode in `str()` but Python2 won't. In Python3 `str(['\u4e2d\u6587', 1])` will return `"['中文', 1]"` which is not equal with Python2: `assert str(['\u4e2d\u6587', 1]) != "['中文', 1]"` – Cloud Apr 03 '18 at 09:30
  • I already have you an example that shows otherwise. – Martijn Pieters Apr 03 '18 at 09:31
  • To add to this: Python containers do nothave a `str()` conversion; they only support `repr()` (`str()` falls back to `repr()` as needed), so only debugging output is supported. The contents of containers are always shown using their `repr()` result. These results are meant to help the developer reproduce the values, where possible. Since Python 2 sour e code use ASCII by default, string representations are given in an ASCII safe form (using escapes for anything not printable or outside the ASCII range). Python 3 uses UTF-8 for source code so there the output is less escape heavy. – Martijn Pieters Apr 03 '18 at 10:34

0 Answers0