1

From the python doc:

Bytes literals are always prefixed with 'b' or 'B'; they produce an instance of the bytes type instead of the str type. They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

I know that I can create a bytes object with b prefix expression like: b'cool', this will convert a unicode string 'cool' into bytes. I'm aslo aware that bytes instance could be created by bytes() function but you need to specify the encoding argument: bytes('cool', 'utf-8').

From my understaing, I need to use one of the encoding rules if I want to tranlate a string into a sequence of bytes . I have done some experiments and it seems b prefix converts string into bytes using utf-8 encoding:

>>> a = bytes('a', 'utf-8')
>>> b'a' == a
True
>>> b = bytes('a', 'utf-16')
>>> b'a' == b
False

My question is when creating a bytes object through b prefix, what encoding does python use? Is there any doc that specifies this question? Does it use utf-8 or ascii as default?

oeter
  • 627
  • 2
  • 8
  • 23
  • 1
    try `a = b'א'` for a sec :D – Adam.Er8 Aug 02 '20 at 09:20
  • 3
    It doesn't impose any encoding. It's up to the programmer to escape any non-ASCII bytes (as stated in your quote from the documentation). – ekhumoro Aug 02 '20 at 09:21
  • 1
    @ekhumoro So `b` prefix actually doesn't translate string into bytes, it justs takes the corresponding code points of the ASCII charaters in string as bytes data, because the first 128 characters of Unicode correspond one-to-one with ASCII. Am I understaning it right? – oeter Aug 02 '20 at 13:39
  • @ Adam.Er8 I am aware that `b` prefixed string could only contain ASCII characters, but I am still confused. – oeter Aug 02 '20 at 13:42
  • @oeterleonard Yes, it's a *bytes literal*, so no conversion takes place. You have to type out the representation of the actual bytes, rather than the text characters. Conveniently, ASCII provides a one-to-one mapping of the first 128 bytes, but beyond that, escapes are required. – ekhumoro Aug 02 '20 at 16:29

2 Answers2

3

The bytes type can hold arbitrary data. For example, (the beginning of) a JPEG image:

>>> with open('Bilder/19/01/IMG_3388.JPG', 'rb') as f:
...     head = f.read(10)

You should think of it as a sequence of integers. That's also how the type behaves in many aspects:

>>> list(head)
[255, 216, 255, 225, 111, 254, 69, 120, 105, 102]
>>> head[0]
255
>>> sum(head)
1712

For reasons of convenience (and for historical reasons, I guess), the standard representation of the bytes, and its literals, are similar to strings:

>>> head
b'\xff\xd8\xff\xe1o\xfeExif'

It uses ASCII printable characters where applicable, \xNN escapes otherwise. This is convenient if the bytes object represents text:

>>> 'Zoë'.encode('utf8')
b'Zo\xc3\xab'
>>> 'Zoë'.encode('utf16')
b'\xff\xfeZ\x00o\x00\xeb\x00'
>>> 'Zoë'.encode('latin1')
b'Zo\xeb'

When you type bytes literals, Python uses ASCII to decode them. Characters in the ASCII range are encoded the same way in UTF-8, that's why you observed the equivalence of b'a' == bytes('a', 'utf8'). A bit less misleading might be the expression b'a' == bytes('a', 'ascii').

lenz
  • 5,658
  • 5
  • 24
  • 44
  • Thank you for making this clear, your explanation really helps me understand better. "It uses ASCII printable characters where applicable, \xNN escapes otherwise." So actually `b` prefix doesn't "translate" string into bytes, it just represent bytes value in the form of ASCII character if its numeric value is under 128. e.g. `b'Z'` equals to `b'\x5a'`, am I understanding it right? – oeter Aug 02 '20 at 16:15
  • 1
    Exactly. You can use `\xNN` escapes for the whole 0..255 range. You can think of the ASCII characters as some kind of pretty-printing proxy for display, or as a short cut for typing. – lenz Aug 02 '20 at 16:21
0

In short, it uses ASCII.

For example suppose you want to save the worlds hello and hellõ to a file using only b-strings, you could try this:

with open("file.txt", "wb") as f:
    f.write(<b-string>)

b'hello': No problem.

b'hellõ': SyntaxError: bytes can only contain ASCII literal characters.

That's because, no matter the content you're trying to write with a b-string, it needs to be encoded as ASCII characters. The docs you quoted mention:

They may only contain ASCII characters; bytes with a numeric value of 128 or greater must be expressed with escapes.

Bytes with numeric values greater than 128 are not ASCII.

So, if you want to create a b-string that contains the weirdest of characters, you need to escape it correctly with a sequence of ASCII characters that correctly represent the hex values.

For example, using only b-string, the way to write 'hellõ' to a file require you to find a representation in hexadecimal. Where "find" means to choose an encoding. Bellow are a few examples of encoding this string.

b'hell\xc3\xb5'  # utf-8
b'hell\xf5'      # latin-1
b'hell\x9b'      # mac_latin2
b'hell\xe4'      # cp775

Notice that it doesn't matter the encoding, all of them are written in ASCII. What matters now is how you're going to decode it.