20

Why do we have different byte oriented string representations in Python 3? Won't it be enough to have single representation instead of multiple?

For ASCII range number printing a string shows a sequence starting with \x:

 In [56]: chr(128)
 Out[56]: '\x80'

In a different range of numbers it Python uses a sequence starting with \u

In [57]: chr(57344)
Out[57]: '\ue000'

But numbers in the highest range, i.e the maximum Unicode number as of now, it uses a leading \U:

In [58]: chr(1114111)
Out[58]: '\U0010ffff'
Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
MaNKuR
  • 2,578
  • 1
  • 19
  • 31

1 Answers1

22

Python gives you a representation of the string, and for non-printable characters will use the shortest available escape sequence.

\x80 is the same character as \u0080 or \U00000080, but \x80 is just shorter. For chr(57344) the shortest notation is \ue000, you can't express the same character with \xhh, that notation only can be used for characters up to \0xFF.

For some characters there are even single-letter escapes, like \n for a newline, or \t for a tab.

Python has multiple notation options for historical and practical reasons. In a byte string you can only create bytes in the range 0 - 255, so there \xhh is helpful and more concise than having to use \U000hhhhh everywhere when you can't even use the full range available to that notation, and \xhh and \n and related codes are familiar to programmers from other languages.

Martijn Pieters
  • 1,048,767
  • 296
  • 4,058
  • 3,343
  • Doesn't thesame logic applies here `\U0010ffff'` and instead it should be like `\U10ffff' or `\u10ffff' – MaNKuR Sep 09 '17 at 17:21
  • 1
    @MaNKuR: no, becusae the `\U` syntax is a fixed width. It takes 8 hex characters; and the `\u` syntax takes 4. If they took a variable number of hex characters you couldn't follow these with other ascii letters or digits that just happen to have hexadecimal meaning but are not part of the escape sequence. – Martijn Pieters Sep 09 '17 at 17:25
  • 2
    @MaNKuR: `\U` is 8 hex characters because the Unicode standard could conceivably expand to need all those digits. Just because the maximum codepoint is `\U0010FFFF` today doesn't mean that a future update to the Unicode standard won't ever reach `\UFFFFFFFF`. – Martijn Pieters Sep 09 '17 at 17:27
  • 1
    I'm still confusing, `\u00a3` and `\xa3` are the same for the symbol `£`. But `\ua3` won't work? – mingchau Aug 22 '19 at 08:28
  • 4
    @mingchau: `\ua3` can't work because that's not a valid `\uhhhh` escape sequence, Python simply doesn't accept shorter forms. That's because accepting shorter escapes would be really confusing, does the text `'Hello \ua3darling'` contain the escape sequence `\ua`, `\ua3`, `\ua3d` or `\ua3da`? – Martijn Pieters Aug 22 '19 at 11:40
  • @MartijnPieters, is this information about differences documented somewhere in official Python docs? If yes - please share a link for reference. – Rocckk Jan 13 '21 at 12:36
  • @Rock escape sequences are part of the [string literal reference](https://docs.python.org/3/reference/lexical_analysis.html#string-and-bytes-literals). – Martijn Pieters Jan 13 '21 at 23:26
  • What if a string has all these mixed? – Vishal Kumar Sahu Jan 17 '21 at 11:14
  • @VishalKumarSahu: The string representation is consistent, and picks the best option for each codepoint in the string. You could have tried this out, of course. :-) – Martijn Pieters Jan 17 '21 at 14:20