1

From what I understand, a Python3 string is a sequence of bytes that has been decoded to be readable by humans, and a Python3 bytes object are the raw bytes that are not human readable. What I'm having trouble understanding, however, is how strings encoded with UTF-8 or ASCII are displayed as a string prefixed by a b, rather than a sequence of bytes

string = "I am a string"

# prints a sequence of bytes, like I would expect
string.encode("UTF-16")
b'\xff\xfeI\x00 \x00a\x00m\x00 \x00a\x00 \x00s\x00t\x00r\x00i\x00n\x00g\x00'


# Prints a sequence of human readable characters, which I don't understand
string.encode("UTF-8")
b'I am a string'

Why does a string encoded by UTF-8 or ASCII not display a sequence of bytes?

much2learn
  • 151
  • 1
  • 2
  • 7
  • "Why does a string encoded by UTF-8 or ASCII not **display** a sequence of bytes?" - The hint is in the question. What is being "displayed" *HAS* to be a design choice, and well, would you rather look at byte strings as numbers or bytestrings that we can understand. :P – Paritosh Singh Aug 24 '19 at 15:33

1 Answers1

3

UTF-8 is a backwards-compatible superset of ASCII, i.e. anything that’s valid ASCII is valid UTF-8 and everything present in ASCII is encoded by UTF-8 using the same byte as ASCII. So it’s not “UTF-8 or ASCII” so much as “just some of ASCII”. Try other Unicode:

>>> "café".encode("UTF-8")
b'caf\xc3\xa9'

or other ASCII that wouldn’t be very helpful to look at in character form:

>>> "hello\f\n\t\r\v\0\N{SOH}\N{DEL}".encode("UTF-8")
b'hello\x0c\n\t\r\x0b\x00\x01\x7f'

The reason the repr of bytes displays printable characters instead of \xnn escapes when possible is because it’s helpful if you do happen to have bytes that contain ASCII.

And, of course, it’s still a well-formed bytes literal:

>>> b'I am a string'[0]
73

Additionally: From the docs

While bytes literals and representations are based on ASCII text, bytes objects actually behave like immutable sequences of integers, with each value in the sequence restricted such that 0 <= x < 256 (attempts to violate this restriction will trigger ValueError). This is done deliberately to emphasise that while many binary formats include ASCII based elements and can be usefully manipulated with some text-oriented algorithms, this is not generally the case for arbitrary binary data

-emphasis added.

At the end of the day, this is a design choice that python made for displaying bytes.

Paritosh Singh
  • 6,034
  • 2
  • 14
  • 33
Ry-
  • 218,210
  • 55
  • 464
  • 476