64

What exactly is a "bytestring" in Python? What is the bytes type, and how does it work internally?

My understanding is that there are normal "ASCII strings", which store a sequence of "characters" which are "ASCII values" ranging from 0-255 inclusive, and each number represents a character. Similarly, I understand that Unicode uses either an 8-bit or 16-bit representation for each character.

To give a clearer example: suppose I do

>>> 'a'.encode()
b'a'

Okay; the result is a bytes that stores one byte.

However, I was told that bytes represents an immutable sequence of bytes with no particular interpretation. So... why can I read the "a"?

If I use the command line to see the ASCII value of the character:

$ printf "%d\n" "'a"
97

This makes some sense. If we interpret the number 97 as ASCII, then we get the letter a. Similarly, that value in binary - extended to 8 bits - would look like 01100001.

So why does 'a'.encode() look like b'a' instead of b'97', or b'01100001' (the underlying bit pattern)? Why does it look the same as if it were being interpreted like ASCII?

For that matter, if I write a bytes to a file opened in binary mode:

with open('testbytestring.txt', 'wb') as f:
    f.write(b'helloworld')

I still see the human-readable text helloworld in the file! Why is that?

Karl Knechtel
  • 62,466
  • 11
  • 102
  • 153
antimatter
  • 3,240
  • 2
  • 23
  • 34
  • When you say "decode to `utf-8`" what you really mean is to use the `decode("utf-8")` method which actually tells Python to interpret your UTF-8 bytes and return a `unicode`. Besides that example (and various wrappers on it like [codecs](https://bugs.python.org/issue8260)) *Python* will not interpret your bytes; you can send the characters to the terminal/file/socket/device etc. to have that consumer (not Python) do the interpretation. – personal_cloud Sep 28 '17 at 07:38

4 Answers4

41

It is a common misconception that text is ASCII or UTF-8 or Windows-1252, and therefore bytes are text.

Text is only text, in the way that images are only images. The matter of storing text or images to disk is a matter of encoding that data into a sequence of bytes. There are many ways to encode images into bytes: JPEG, PNG, SVG, and likewise many ways to encode text, ASCII, UTF-8 or Windows-1252.

Once encoding has happened, bytes are just bytes. Bytes are not images anymore; they have forgotten the colors they mean; although an image format decoder can recover that information. Bytes have similarly forgotten the letters they used to be. In fact, bytes don't remember whether they were images or text at all. Only out of band knowledge (filename, media headers, etcetera) can guess what those bytes should mean, and even that can be wrong (in case of data corruption).

so, in Python (Python 3), we have two types for things that might otherwise look similar; For text, we have str, which knows it's text; it knows which letters it's supposed to mean. It doesn't know which bytes that might be, since letters are not bytes. We also have bytestring, which doesn't know if it's text or images or any other kind of data.

The two types are superficially similar, since they are both sequences of things, but the things that they are sequences of is quite different.

Implementationally, str is stored in memory as UCS-? where the ? is implementation defined, it may be UCS-4, UCS-2 or UCS-1, depending on compile time options and which code points are present in the represented string.


"But why"?

Some things that look like text are actually defined in other terms. A really good example of this are the many Internet protocols of the world. For instance, HTTP is a "text" protocol that is in fact defined using the ABNF syntax common in RFCs. These protocols are expressed in terms of octets, not characters, although an informal encoding may also be suggested:

2.3. Terminal Values

Rules resolve into a string of terminal values, sometimes called characters. In ABNF, a character is merely a non-negative integer. In certain contexts, a specific mapping (encoding) of values into a character set (such as ASCII) will be specified.

This distinction is important, because it's not possible to send text over the internet, the only thing you can do is send bytes. saying "text but in 'foo' encoding" makes the format that much more complex, since clients and servers need to now somehow figure out the encoding business on their own, hopefully in the same way, since they must ultimately pass data around as bytes anyway. This is doubly useless since these protocols are seldom about text handling anyway, and is only a convenience for implementers. Neither the server owners nor end users are ever interested in reading the words Transfer-Encoding: chunked, so long as both the server and the browser understand it correctly.

By comparison, when working with text, you don't really care how it's encoded. You can express the "Heävy Mëtal Ümlaüts" any way you like, except "Heδvy Mλtal άmlaόts"


The distinct types thus give you a way to say "this value 'means' text" or "bytes".

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
  • It **is** representing text, since if I type `b'Hello World'` into the interpreter, it returns `b'Hello World'`. How does it know that it's a character? From what I read in the docs, it represents ASCII characters 0-127, and everything else has an escape sequence. Why not just call it an ascii string? is it because .encode('ascii') is extended ascii (0-255)? is it so that you can represent many escape sequences? – antimatter Apr 02 '14 at 23:04
  • 2
    only a human may recognize `b'Hello World'` as text. On the other hand, `b'GIF89a\x01\x00\x01\x00\x80\x01\x00\xff\xff\xff\x00\x00\x00!\xf9\x04\x01\n\x00\x01\x00,\x00\x00\x00\x00\x01\x00\x01\x00\x00\x02\x02L\x01\x00;'` is not text at all. Neither can be decoded with the `utf16le` encoding expected by certain win32 api's. – SingleNegationElimination Apr 02 '14 at 23:17
  • Ok, but if you do `echo "hello" | od -bc` in the terminal, it will show you the integer representation for each index in the array representing the string "hello". In this case `150 145 154 154 157 012`. So why is `b'hello'` returning `b'hello'` instead of something like `150 145 154 154 157 012`? It seems like it's interpreting anything from 0-255* as ASCII. Am I wrong? – antimatter Apr 03 '14 at 00:06
  • 3
    let me repeat this, a `bytestring` *represents* an immutable sequence of bytes, without implying any particular interpretation, as text or otherwise, whereas `str` *represents* an immutable sequence of unicode codepoints, without implying any particular binary encoding. the fact that the python literals for each looks similar is only a convenience. – SingleNegationElimination Apr 03 '14 at 00:50
  • I still don't see how this works under the hood... A sequence of bytes, ok, but what is a byte? A byte is 8 bits. Say I do `printf "%d" "'a"`, returning `97`. 2's complement representation is `01100001` *(2^1+2^5+2^6=97)*. So why is `'a'.encode()` not returning `b'01100001'`? Instead it's interepreting the bytes to *represent* something... – antimatter Apr 03 '14 at 01:06
  • no, *you're* interpreting the bytes to represent something. bytes don't "mean" anything. – SingleNegationElimination Apr 03 '14 at 01:22
  • Is `a` a byte? Last I checked, it was a character. `01100001` are bits, one byte. Not sure what I'm interpreting, except that `b'a'` looks a hell of a lot like a character and not a byte – antimatter Apr 03 '14 at 02:06
  • 10
    The interpreter calls the magic `__repr__()` function to give you a readable representation of the bytestring. `__repr__()` is defined as returning a string, so it gives a possibly-meaningful-to-humans string by treating the bytestring as ASCII or UTF-8. That doesn't mean the underlying bytestring necessarily represents ASCII or looks like a string. The map is not the territory. – Russell Borogove Apr 03 '14 at 03:39
  • Actually, I'm not sure if this is the case either. Since you can write the bytestring to a file and it will *still* represent it as regular characters... – antimatter Apr 03 '14 at 07:26
  • "common misconception that text is ascii or utf8"???? So the 99% of software out there that stores AND processes text in UTF8 is all "misconceived"? No, it can't be merely "misconception"; it is more like "massive conspiracy". Or, in less inflammatory terms, how about we just call it "the normal world". – personal_cloud Oct 05 '17 at 15:39
  • Does this answer the question of why bytes are not presented as 0's & 1's? You say to the OP that "you're interpreting the bytes to represent something", but I think the OP's question clearly stated is "why is PYTHON interpreting the bytes to represent something?", i.e. printing the `bytes` as if there is some implied encoding, instead of as 0's & 1's. – Joe Jun 17 '23 at 13:08
31

Python does not know how to represent a bytestring. That's the point.

When you output a character with value 97 into pretty much any output window, you'll get the character 'a' but that's not part of the implementation; it's just a thing that happens to be locally true. If you want an encoding, you don't use bytestring. If you use bytestring, you don't have an encoding.

Your piece about .txt files shows you have misunderstood what is happening. You see, plain text files too don't have an encoding. They're just a series of bytes. These bytes get translated into letters by the text editor but there is no guarantee at all that someone else opening your file will see the same thing as you if you stray outside the common set of ASCII characters.

Jack Aidley
  • 19,439
  • 7
  • 43
  • 70
  • Thanks, this and Russell's answer cleared up the confusion for me. – antimatter Apr 03 '14 at 09:01
  • 3
    TLDR - The basic issue that was cleared up to me was that both text editors, the python interpreter (using `__repr__`), etc, interpret a bytestring in `ASCII` (assuming no encoding specified) to potentially represent something meaningful to the user. – antimatter Mar 12 '16 at 10:43
  • Actually, text editors can be pretty liberal in how they interpret text. Some assume UTF-8 by default (which is a super-set of 7-bit ASCII); many use heuristics to guess. – jpaugh Jan 29 '18 at 20:01
  • 1
    @jpaugh There is no *guarantee* that a text editor will default to UTF-8, but it's a reasonable assumption, just like assuming that a web browser will default to JavaScript or even HTML5. Standardization is progress. As for the heuristics that you mention, they are designed to (by default) not interfere with UTF-8. – personal_cloud Jul 26 '19 at 20:50
8

As the name implies, a Python 3 bytestring (or simply a str in Python 2.7) is a string of bytes. And, as others have pointed out, it is immutable.

It is distinct from a Python 3 str (or, more descriptively, a unicode in Python 2.7) which is a string of abstract Unicode characters (a.k.a. UTF-32, though Python 3 adds fancy compression under the hood to reduce the actual memory footprint similar to UTF-8, perhaps even in a more general way).

There are essentially three ways of "interpreting" these bytes. You can look at the numeric value of an element, like this:

>>> ord(b'Hello'[0])  # Python 2.7 str
72
>>> b'Hello'[0]  # Python 3 bytestring
72

Or you can tell Python to emit one or more elements to the terminal (or a file, device, socket, etc.) as 8-bit characters, like this:

>>> print b'Hello'[0] # Python 2.7 str
H
>>> import sys
>>> sys.stdout.buffer.write(b'Hello'[0:1]) and None; print() # Python 3 bytestring
H

As Jack hinted at, in this latter case it is your terminal interpreting the character, not Python.

Finally, as you have seen in your own research, you can also get Python to interpret a bytestring. For example, you can construct an abstract unicode object like this in Python 2.7:

>>> u1234 = unicode(b'\xe1\x88\xb4', 'utf-8')
>>> print u1234.encode('utf-8') # if terminal supports UTF-8
ሴ
>>> u1234
u'\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<type 'unicode'>
>>> len(u1234)
1
>>>

Or like this in Python 3:

>>> u1234 = str(b'\xe1\x88\xb4', 'utf-8')
>>> print (u1234) # if terminal supports UTF-8 AND python auto-infers
ሴ
>>> u1234.encode('unicode-escape')
b'\\u1234'
>>> print ('%04x' % ord(u1234))
1234
>>> type(u1234)
<class 'str'>
>>> len(u1234)
1

(and I am sure that the amount of syntax churn between Python 2.7 and Python3 around bystestring, strings, and Unicode had something to do with the continued popularity of Python 2.7. I suppose that when Python 3 was invented they didn't yet realize that everything would become UTF-8 and therefore all the fuss about abstraction was unnecessary).

But the Unicode abstraction does not happen automatically if you don't want it to. The point of a bytestring is that you can directly get at the bytes. Even if your string happens to be a UTF-8 sequence, you can still access bytes in the sequence:

>>> len(b'\xe1\x88\xb4')
3
>>> b'\xe1\x88\xb4'[0]
'\xe1'

And this works in both Python 2.7 and Python 3, with the difference being that in Python 2.7 you have str, while in Python3 you have bytestring.

You can also do other wonderful things with bytestrings, like knowing if they will fit in a reserved space within a file, sending them directly over a socket, calculating the HTTP content-length field correctly, and avoiding Python Bug 8260. In short, use bytestrings when your data is processed and stored in bytes.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
personal_cloud
  • 3,943
  • 3
  • 28
  • 38
1

Bytes objects are immutable sequences of single bytes. The documentation has a very good explanation of what they are and how to use them.

Peter Mortensen
  • 30,738
  • 21
  • 105
  • 131
MattDMo
  • 100,794
  • 21
  • 241
  • 231
  • 1
    Ok, so it says "Only ASCII characters are permitted in bytes literals (regardless of the declared source code encoding). Any binary values over 127 must be entered into bytes literals using the appropriate escape sequence.". So what's the point of having them rather than using ASCII? is it for compatibility purposes where something can't read extended ascii (0-255)? – antimatter Apr 02 '14 at 22:58
  • 1
    @gyeh What exactly is extended ASCII? Especially, what does the character 252 mean in your so-called "extended ASCII"? Is it `ü`, as in `latin1`? Or is it `³` as in `cp850`? Or `ⁿ` (`cp437`)? There are many options. So the mapping from bytes to characters depends on the encoding. And that's why both strings and bytestrings exist: strings hold characters, whose "byte representation" depends on the encoding used, and bytestrings hold bytes, whose "character meaning" depends on the encoding. – glglgl Apr 02 '14 at 23:17
  • 1
    I understand that extended ascii requires an encoding. I think my point is being missed here. If I do `echo "hello" | od -bc`, it will show the integer value for each index of the array representing hello as a string. In this case: `150 145 154 154 157 012`. So why is `b'hello'` human readable instead of those stream of numbers? It seems like it's doing ASCII representation to me. – antimatter Apr 03 '14 at 00:04