3

Could there be a Python implementation where the assertion fails:

assert all(byte in range(256) for byte in any_bytes_object) # Python 3 semantics 
assert all(byte in range(256) for byte in map(ord, any_bytes_object)) # Python 2

POSIX specifies explicitly that CHAR_BIT == 8 (8 bits per byte). Is there a similar guarantee in Python? Is it documented somewhere?

Python 2 reference says: "Characters represent (at least) 8-bit bytes."

If bytes name is not defined (on old Python versions) e.g., on Jython 2.5 then the question is about str type (bytestrings) i.e., bytes = str on Python 2.

jfs
  • 399,953
  • 195
  • 994
  • 1,670
  • I was gonna nit pick(a str is not bytes in python3) ... but then i saw it was you... Im pretty sure you know this and really mean a bytes object – Joran Beasley Mar 28 '16 at 15:47
  • @JoranBeasley: I don't see `str` being mention in the question. I've updated the question to show both Python 3 and Python 2 code examples. – jfs Mar 28 '16 at 15:50
  • While in C the terms `char` and `byte` are basically synonymous, POSIX originally made no claims about the number of bits in a byte. The C99 spec in a roundabout way required that a byte be 8-bits (by constructing an obtuse set of requirements whose only solution was an 8-bit byte, without actually stating it explicitly), and as such The Open Group has in modern POSIX standards (2001-now) mandated that bytes are 8 bits (rationale here: http://pubs.opengroup.org/onlinepubs/009695399/xrat/xbd_chap03.html). All that being said, the Python language reference does require that bytes be 8-bits. – Nick Bastin Mar 28 '16 at 16:03
  • @NickBastin: 1- I don't see where Python language reference (Python 2) requires that bytes be **exactly** 8-bits. Python 2 spec says [**at least**](http://stackoverflow.com/questions/36265726/are-bytes-always-octets-in-python/36266136#comment60160415_36266136) 2- I don't understand how is it relevant for this question why POSIX has chosen `CHAR_BIT == 8` (`sizeof(char) == 1` i.e., char is byte here) 3- In principle, Python definition of what `byte` is may differ from the `byte` definition used by other parts of the system such as JVM, C compiler, CPU. Though they are in agreement in practice. – jfs Mar 28 '16 at 16:52
  • @J.F.Sebastian: The python spec for 3.6 says exactly 8 bits: https://docs.python.org/3.6/reference/datamodel.html#objects-values-and-types Also the POSIX question is relevant because your question implicitly claims that CHAR_BIT is a byte (when you say `8 bits per byte)`, but that is not true. CHAR_BIT and byte have no specified relationship. – Nick Bastin Mar 28 '16 at 17:47
  • @NickBastin 1- my question is not limited to Python 3 e.g., Python 2 implementations such as jython, pypy are mentioned explicitly 2- do you think it is possible that the number of bits in a byte is different from CHAR_BIT? – jfs Mar 28 '16 at 17:56
  • @J.F.Sebastian: Yes there are absolutely platforms where the number of bits in a character is not the number of bits in a byte (because characters are more likely to move over the network and there should be agreement, which even say DSP platforms with strange byte sizes will respect in the compiler for a char - presuming modern toolchains that have been updated for C99 compliance, which is generally true now, but not say 5 years ago even). – Nick Bastin Mar 28 '16 at 17:58
  • @NickBastin the network is defined (usually) in terms of octets (not bytes, not characters). I understand a DSP platform where `CHAR_BIT != 8` but `sizeof(char) == 1` i.e., the number of bits in a char is the number of bits in a byte anyway. Are you saying there are platforms where `sizeof(char)!=1`? – jfs Mar 28 '16 at 18:24
  • @J.F.Sebastian: First, `CHAR_BIT` is always 8, because that is what the standard says. But also, `sizeof char == 1` always, because the standard also makes that a requirement (since long before C99), regardless of implementation (the underlying implementation may do whatever it wants, but the compiler *must* report that `sizeof char` is `1`). My point is that neither of those types are guaranteed to be a byte from the perspective of your architecture. (move to chat failed for some reason, so still posting here) – Nick Bastin Mar 28 '16 at 21:49
  • @NickBastin 1- **wrong**. [`CHAR_BIT` is not always `8`](http://stackoverflow.com/q/2098149) 2- yes, `sizeof(char)==1` (it is 3rd time I'm mentioning it) and therefore `char` is byte here (as I said already). You have failed to provide an example when char is not byte. – jfs Mar 28 '16 at 22:32
  • @J.F.Sebastian: CHAR_BIT is mandated by POSIX spec to be 8. If it's not 8, then it's not POSIX compliant. This is certainly possible, but seems out of the scope here. Also just because `sizeof char == 1` doesn't mean that's a *byte*, and it doesn't mean that the storage unit for `char` is actually `CHAR_BIT` (it just needs to *appear* to be, there is no requirement for it to *actually* be `CHAR_BIT` internally). I don't really know what you're trying to say here - my only point is and has been that `CHAR_BIT` has no predictable relationship to the byte size of an architecture. – Nick Bastin Mar 29 '16 at 02:15
  • @NickBastin 1- You know that I know that POSIX specifies `CHAR_BIT == 8` — it is said explicitly in the question. POSIX in my question is an example of the specification that defines `CHAR_BIT` value explicitly. It doesn't mean that the question itself is limited to POSIX. I agree that these comments are out of scope — as I said in the reply to your the very first comment: I don't see how this discussion is relevant to the question. 2- I have found explicit reference in c99: *"A byte contains `CHAR_BIT` bits"* that is the assertion in the question stands: `CHAR_BIT==8` implies 8-bit bytes. – jfs Mar 29 '16 at 03:14
  • @J.F.Sebastian: As I said in my 2nd comment, and will repeat now, my only problem is with this statement in your question: "POSIX specifies explicitly that CHAR_BIT == 8 (8 bits per byte)" This implies that POSIX specifies `CHAR_BIT` as a byte and *this is not true*. – Nick Bastin Mar 29 '16 at 03:40
  • @NickBastin: this discussion goes nowhere. What does it even mean: *"POSIX specifies `CHAR_BIT` as a byte"*. C standard specifies that `CHAR_BIT` is the number of bits in a byte therefore POSIX also specifies it. POSIX (in addition) specifies that `CHAR_BIT == 8` and POSIX also says explicitly:[*"`CHAR_BIT` Number of bits in a type `char`."*](http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/limits.h.html) However you slice it: the number of bits in a byte is equal to the number of bits in `char`. – jfs Mar 29 '16 at 03:56
  • @J.F.Sebastian: No it isn't! The number of bits in a byte on a given architecture is architecture dependent! The number of bits in a char is always 8. Bits in `char` != bits in byte for many architectures. – Nick Bastin Mar 29 '16 at 03:58
  • @NickBastin: I've asked a separate question (with self-answer): [Is the number of bits in a byte equal to the number of bits in a type char?](http://stackoverflow.com/q/36289134/4279) – jfs Mar 29 '16 at 15:57

3 Answers3

5

The bytes object Python 3 documentation says

bytes objects actually behave like immutable sequences of integers, with each value in the sequence restricted such that 0 <= x < 256

And the bytearray type is documented in both Python 3 and Python 2 as

a mutable sequence of integers in the range 0 <= x < 256

so the language is designed under the assumption of 8-bit bytes.


The Python 2 data model section saying "at least" 8 bits seems to just be one of the places where the Python 2 documentation hasn't been kept up to date very well compared to the Python 3 documentation. It dates back to at least Python 1.4, back in the really early days when they weren't sure whether they might want to support weird byte sizes.

Since at least the introduction of unicode support in the 2.0 release, the documentation has been full of places referring to the bytestring type as "8-bit strings". Python isn't as rigorously specified as something like C, but I'd say any "conforming" implementation of Python 2.0 or up would have to have 8-bit bytes.

user2357112
  • 260,549
  • 28
  • 431
  • 505
  • Could you comment on the ["at least" part in Python 2 reference](http://stackoverflow.com/a/36267980/4279)? – jfs Mar 29 '16 at 14:55
  • @J.F.Sebastian: I'd say that's something they just never bothered to change in the Python 2 docs. It's at least as old as the [1.4 documentation](https://docs.python.org/release/1.4/ref/ref3.html#REF19057). There are a lot of places where the Python 2 documentation hasn't been kept up to date as well as the Python 3 docs. Plenty of other places in the Python 2 docs talk about "8-bit strings", apparently because they suddenly had to disambiguate them from unicode strings when 2.0 introduced unicode support. – user2357112 Mar 30 '16 at 02:33
  • Python isn't as rigorously specified as a language like C, but I'd say any "conforming" implementation of a Python version 2.0 or up would have to have 8-bit bytestrings. – user2357112 Mar 30 '16 at 02:37
  • Yes, [8-bit is enforced in CPython since 2003 (at least)](https://hg.python.org/cpython/rev/dfd0114cf9ec/). Python 3 docs formalize it. Every other place in Python 2 docs mentions only 8-bit strings. I don't know of any Python 2 implementation that does not use 8-bit strings; therefore the answer is: "bytes are always octets in Python". Could you incorporate your comment into the answer. – jfs Apr 02 '16 at 22:52
  • @J.F.Sebastian: My comments have been incorporated into the answer. – user2357112 Apr 02 '16 at 23:13
1

Additionally to the official documentation cited by user2357112, we can consult the Python enhancement proposals that introduced the bytes object.

PEP 358 -- The "bytes" Object specified:

A bytes object stores a mutable sequence of integers that are in the range 0 to 255.

As we know that bytes objects ended up immutable, this specification can not be fully applicable and the 'range' part of it might be moot, too.

Interestingly, PEP 3137 -- Immutable Bytes and Mutable Buffer, which partially superseded PEP 358 (and specifies bytes as immutable and introduces bytearrays as the mutable equivalent) only specifies what you can put into bytes objects and into bytearrays ("int[eger]s in range(256)"), but not what may come out of them.

Neither PEP mentions "bit" or "bits" at all. (Though we know from bitwise boolean operations how Python integers map to bit patterns, so there shouldn't be any surprises there, I hope.)

Community
  • 1
  • 1
das-g
  • 9,718
  • 4
  • 38
  • 80
  • 1
    https://docs.python.org/3.6/reference/datamodel.html#objects-values-and-types The section under immutable sequences explicitly states that bytes are 8-bits. – Nick Bastin Mar 28 '16 at 16:13
  • @NickBastin Good find. That should probably be an answer it it's own right, quoting the relevant section of that page! – das-g Mar 28 '16 at 16:15
  • @NickBastin: [Python 2 reference says: _"Characters represent (**at least**) 8-bit bytes."_](https://docs.python.org/2/reference/datamodel.html#index-20) – jfs Mar 28 '16 at 16:23
  • @das-g: `bytes` is an *immutable* sequence in Python and therefore PEP 358 that says differently can't be used as a reference. – jfs Mar 28 '16 at 16:27
  • @J.F.Sebastian re (im)mutability: You're right. I've edited the question to account for that. – das-g Mar 28 '16 at 17:08
  • Re "That should probably be an answer it it's own right [...]": As @NickBastin didn't post one by now, I've tried to put this into a [community wiki answer](http://stackoverflow.com/a/36267980/674064). Feel free to contribute. :-) – das-g Mar 28 '16 at 18:01
1

Python 3

Since Python 3.0, the Python language reference specifies:

A bytes object is an immutable array. The items are 8-bit bytes, represented by integers in the range 0 <= x < 256.

Python 2

Before that (i.e., up to Python 2.7), it specified (as already mentioned in the question):

The items of a string are characters. […] Characters represent (at least) 8-bit bytes.

(Emphasis added.)

Note that Python 2 did not have a bytes object. To hold immutable sequences of byte-chunked binary data in Python 2, strings were/are usually used. (Python 3 strings in contrast are meant for textual data only and are more equivalent to Python 2's unicode objects than to Python 2 strings.)

but ...

Python 2 documentation of the ord() function mentions "8-bit strings" and contrasts them to unicode objects. It might be implied that all non-unicode Python-2 strings are 8-bit strings, but I wouldn't count on that.

Conclusion

A Python implementation that provides Python-3-compliant bytes objects would be restricted to only hold 8-bit bytes in them. A Python implementation compliant to Python 2 would not be bound by this (as a bytes object, if it features one, would be unspecified) and if you'd use its Python-2-compliant strings as a substitute, there wouldn't be any guarantees about maximum byte size (actually, character size), either, unless the implementation states some guarantees of its own.

das-g
  • 9,718
  • 4
  • 38
  • 80
  • `bytes is str` on Python 2. If `bytes` is not defined (old Python versions) then the question is about `str` type (bytestrings). – jfs Mar 28 '16 at 18:06