33

There's a note in the POSIX rationale that mandating CHAR_BIT be 8 was a concession made that was necessary to maintain alignment with C99 without throwing out sockets/networking, but I've never seen the explanation of what exactly the conflict was. Does anyone have anecdotes or citations for why it was deemed necessary?

Edit: I've gotten a lot of speculative answers regarding why it's desirable for CHAR_BIT to be 8, and I agree, but what I'm really looking for is what the technical conflict between C99 and the networking stuff in POSIX is. My best guess is that it has something to do with C99 requiring uint*_t to be exact-sized types (no padding) whereas the inttypes.h previously in POSIX made no such requirement.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • 6
    Is "because a lot of code would break otherwise" a good answer? – user541686 Jul 08 '11 at 23:14
  • 2
    because we are used to that!! – Ulterior Jul 08 '11 at 23:19
  • 2
    @user: There are lots of things POSIX does which run counter to what "most programmers are used to" -- e.g. `fork`. When you learn `fork` it does nothing like anything you've ever seen before. However, it's a core of the Unix process manipulation model. – Billy ONeal Jul 08 '11 at 23:27
  • @Billy ONeal I did forks, not related to the question – Ulterior Jul 08 '11 at 23:29
  • @user: I never said it was related to the question. I'm saying "Because people are used to that" isn't exactly the best reason for making it that way. People get used to new things. – Billy ONeal Jul 08 '11 at 23:31
  • 8
    @R..: I think you might have answered your own question there. If prior versions of Posix *required* that `uint8_t` exists, but permit it to have padding, and then C99 comes along and doesn't require `uint8_t` to exist, but says that if it does then it must not have padding, Posix has two choices if it is to incorporate C99 - un-require that `uint8_t` exists (which renders programs that were valid, invalid), or else require that it has no padding, (which renders implementations that were conforming, non-conforming). The latter may well be the lesser evil. – Steve Jessop Jul 08 '11 at 23:53
  • @Steve Jessop: The latter is certainly the lesser evil, since such implementations at least remaining conforming to earlier revisions of POSIX. – caf Jul 09 '11 at 01:57
  • @Steve: Wow, I think it probably is that simple. I hadn't thought of the fact that `uint8_t` was probably already required to exist, just that it might have been used in the specification of some other structures, which could have easily been changed to use `uint_least8_t`. Of course if conforming applications are already using `uint8_t` though, you can't really remove it... – R.. GitHub STOP HELPING ICE Jul 09 '11 at 03:38
  • @Steve: If you'd like some rep, post that as an answer. It's the first that actually answers my question and I think you nailed it. – R.. GitHub STOP HELPING ICE Jul 09 '11 at 22:42
  • 1
    @R..: I would, but the OpenGroup website is an absolute shocker, and I've failed to confirm that any pre-C99 version of Posix required `uint8_t`. – Steve Jessop Jul 10 '11 at 11:59
  • Indeed, it seems this question remains open... – R.. GitHub STOP HELPING ICE Feb 15 '12 at 19:26
  • 4
    The rationale section of the POSIX spec for `stdint.h` explicitly says that `CHAR_BIT == 8` is a consequence of adding `int8_t`: http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/stdint.h.html I don't know when that verbiage was added to the POSIX document. – Michael Burr May 21 '15 at 21:00

3 Answers3

12

Because the vast majority of standards (related to communication) out of ANSI and ISO talk in terms of octets (8-bit values). There is none of that wishy-washy variable-sized character nonsense :-)

And, since a rather large quantity of C code used char or unsigned char for storing and/or manipulating these values, and assumed they were 8 bits wide, the fact that ISO allowed a variable size would cause problems for that code.

Remember one of the over-riding goals of ISO C - existing code is important, existing implementations are not. This is one reason why limits.h exists in the first place rather than just assuming specific values, because there was code around that assumed otherwise.

POSIX also followed that same guideline. By mandating a byte size of 8 bits, they prevented the breakage of a huge amount of code already in the real world.

paxdiablo
  • 854,327
  • 234
  • 1,573
  • 1,953
  • 2
    In the middle para: "ISO" is a bit confusing, since ISO is famously the author of C99 (which does allow `CHAR_BIT != 8`), and slightly less famously ratifies Posix standards (which don't). So either it would cause problems if it did, or it does cause problems since it does, depending which standard you're talking about. – Steve Jessop Jul 08 '11 at 23:27
  • Sorry, Steve, I was just saying that POSIX has guidelines to follow, just like ISO. I'll try to clarify. – paxdiablo Jul 08 '11 at 23:29
9

Because char is the smallest addressable unit in C, if you made char larger than 8 bits, it would be difficult or impossible to write a sockets implementation, as you said. Networks all run on CHAR_BIT == 8 machines. So, if you were to send a message from a machine where CHAR_BIT == 9 to a machine where CHAR_BIT == 8, what is the sockets library to do with the extra bit? There's no reasonable answer to that question. If you truncate the bit, then it becomes hard to specify even something as simple as a buffer to the client of the sockets code -- "It's a char array but you can only use the first 8 bits" would be unreasonable on such a system. Moreover, going from 8 bit systems to 9 bit would be the same problem -- what's the sockets system to do with that extra bit? If it sets that bit to zero, imagine what happens to someone who puts an int on the wire. You'd have to do all kinds of nasty bitmasking on the 9 bit machine to make it work correctly.

Finally, since 99.9% of machines use 8 bit characters, it's not all that great a limitation. Most machines that use CHAR_BIT != 8 don't have virtual memory either, which would exclude them from POSIX compatibility anyway.

When you're running on a single machine (as standard C assumes), you can do things like be CHAR_BIT agnostic, because both sides of what might be reading or writing data agree on what's going on. When you introduce something like sockets, where more than one machine is involved, they MUST agree on things like character size and endianness. (Endinanness is pretty much just standardized to Big Endian on the wire, though, as many more architectures differ on endianness than they do on byte size)

Billy ONeal
  • 104,103
  • 58
  • 317
  • 552
  • Does POSIX require virtual memory? I remember reading the rational for `posix_spawn` which states "processes are too useful to simply option out of POSIX whenever it must run without address translation or other MMU services." – Dietrich Epp Jul 08 '11 at 23:28
  • Yes POSIX requires memory protection, shared memory-mapped files, etc. The text about `posix_spawn` is in regard to implementors wanting to implement a strict subset of POSIX that's not conformant in itself. – R.. GitHub STOP HELPING ICE Jul 08 '11 at 23:29
  • @Dietrich: Things like `fork` cannot be implemented without some form of virtual memory. There is a subset of POSIX that is implementable on non-MMU machines, but a fully conforming implementation needs one. – Billy ONeal Jul 08 '11 at 23:30
  • @Billy: What your answer still doesn't tell me is why the advent of C99 is what made POSIX decide it was necessary to require CHAR_BIT to be 8. Prior to that, POSIX had networking, and as far as I know CHAR_BIT was allowed to take other values. – R.. GitHub STOP HELPING ICE Jul 08 '11 at 23:30
  • @R. I am/was not aware of that. As far as I knew, it implicitly required `CHAR_BIT == 8` beforehand implicitly. Only after C99 was published it was made explicit. Sort of like how [C++ does not require `basic_string` to use a contiguous buffer but it's pretty much required by it's interface](http://stackoverflow.com/questions/2256160/how-bad-is-code-using-stdbasic-stringt-as-a-contiguous-buffer). – Billy ONeal Jul 08 '11 at 23:32
  • See http://pubs.opengroup.org/onlinepubs/9699919799/basedefs/limits.h.html- the requirement was added in Issue 6 (SUSv3) which was the first version aligned with C99. I can't find the text in the rationale at the moment... – R.. GitHub STOP HELPING ICE Jul 08 '11 at 23:35
  • 2
    Personally, if I was going to define a CHAR_BIT-agnostic sockets API, I would define all the network functions to take `char*` buffers, but to only read or write the low 8 bits of those buffers as network octets. You'd then also have to sort out addresses and port numbers - port 257 still has to be represented on the wire as two octets, 0x0101, so hton/ntoh would be defined to do more than just alter byte order, they'd also insert/remove padding bits. It's inefficient communicating between two 16-bit-char machines, using twice as much memory as necessary, but better than no comms at all... – Steve Jessop Jul 08 '11 at 23:35
  • I don't object to Posix's decision not to do that, but I also don't see why it's an insurmountable difficulty. "You'd have to do all kinds of nasty bitmasking on the 9 bit machine to make it work correctly." - you'd have to read a 4-octet integer as `((unsigned long)buf[0] << 24) | (buf[1] << 16) | (buf[2] << 8) | (buf[0])`, which is the slow-but-portable way we already know about. Hardware that wanted to support efficient networking would have to find a fast way to do that, perhaps by providing an instruction specifically designed for use by `ntohl`. – Steve Jessop Jul 08 '11 at 23:40
  • May I point out that the 9-bit model [isn't a thought experiment](http://en.wikipedia.org/wiki/TOPS-20)? Then again, it also brought us FTP's `type L 8` as part of the bargain. – geekosaur Jul 09 '11 at 01:29
  • @Steve: Do you really expect most code that runs on modern machines to have to always do that? Seems a bit extreme to me. – Billy ONeal Jul 09 '11 at 01:37
  • @GeekOSaur: There used to be a lot of machines that used 9 bit. I'm unaware, however, of any modern machine that does so. TOPS-20 went by the wayside a long long time ago. – Billy ONeal Jul 09 '11 at 01:38
  • @Billy: Your claim is that a 9-bit character isn't reasonably compatible with sockets. TENEX proved otherwise. No, it's not used now, but that doesn't change that it proved that `char` doesn't have to be isomorphic to *octet* for sockets to work. – geekosaur Jul 09 '11 at 01:44
  • @heekosaur: I don't believe TENEX used BSD sockets (as POSIX does) I could be wrong though. – Billy ONeal Jul 09 '11 at 01:49
  • @Billy: No, I expect *most* modern code that runs on modern machines to have an 8 bit byte, in which case the interface I described is identical to the current Posix interface. The small minority of machines with a different-sized byte would still be able to play, if they wanted. I'm not sure that any do want at the moment, though. If any do, no doubt they have defined their own non-Posix-conformant variant of the BSD sockets interface, and we could see what they've done. – Steve Jessop Jul 09 '11 at 08:55
  • @SteveJessop in fact that's exactly what the remaining 9-bit byte ("banker's byte" I like to call it, since it only persists in some financial institutions) machines do: their networking APIs (and even storage/file APIs, if I remember right) just say that the 9th bit is ignored when writing bytes (`char`s) out and is set to zero when reading bytes in. If anyone is interested, IIRC, the relevant machines were called "ClearPath Dorado" (AIUI just octet byte Intel chips with a thin 9-bit byte CPU emulation), and a few years back you could find official PDFs online of their C APIs/etc. – mtraceur Jul 01 '23 at 05:51
1

My guesses:

  • Lots of code goes through bits like

    for (int i = 0; i < 8; i++) { ... }
    

    and all that would break.

  • Most other languages assume it's 8 bits anyway, and they would completely break if it's otherwise

  • Even if most languages didn't require this, most ABIs would still break

  • It's handy in hexadecimal (two nibbles): 0xAA

  • If you start going that route, then you could start thinking: Well, who says we have to use 2-state bits? Why not have tristate bits? etc... it just starts getting less and less practical.

user541686
  • 205,094
  • 128
  • 528
  • 886
  • One could make the argument that such languages/code are simply wrong. (As the C standard (and by extension C++ standard) do) It's easy to fix the first loop by replacing `8` with `CHAR_BIT`. However, sound reasoning, so +1. – Billy ONeal Jul 08 '11 at 23:28
  • @Billy, I suspect most of that code was written before ANSI/ISO started their C work. – paxdiablo Jul 08 '11 at 23:34
  • @paxdiablo: well then how come Posix agreed with Mehrdad's argument "it would break lots of code", but ANSI/ISO didn't (and allowed `CHAR_BIT !=8` in C)? Are you in effect saying that IEEE is more sensitive to this code than ANSI/ISO are? In which case it's basically just the luck of whoever was on the relevant committees at the time. – Steve Jessop Jul 08 '11 at 23:49
  • Supposition on my part, @Steve, but ISO was defining a language that could be used everywhere, from the smallest microcontroller to the big honkin' zEnterprise machines. That's also why they didn't mandate contiguity (?) of the alpha characters, since EBCDIC was out there. POSIX had more leeway, they were defining a more limiting environment over and above C and, if it didn't run on an RCA1802 or 8051 microcontroller, that's tough (those babies wouldn't have enough grunt to do _most_ of POSIX anyway). – paxdiablo Jul 08 '11 at 23:54
  • In other words, neither of them wanted to break code, it's just that the code sets they didn't want to break were not necessarily the same. Again, this is _opinion_ on my part, I didn't sit on any of the working groups. – paxdiablo Jul 08 '11 at 23:55
  • 1
    @pax: fair enough, more than luck, then. A prognostication on the part of the Posix committee that there will be no future thrust to put 16-bit-char hardware into the OS game, whereas C wants to run on DSPs and stuff. The prognostication is almost self-fulfilling, I suppose, since now that Posix has made that requirement, anyone who wants to build a CPU plus a Posix OS for it is just going to have to make it 8-bit addressable. The 8-16 switch would be astonishingly painful whatever, and won't even start unless Posix changes the restriction or becomes obsolete :-) – Steve Jessop Jul 08 '11 at 23:56
  • @SteveJessop fwiw I hear the Crays had 64-bit addressable hardware but faked 8-bit addressability in their C implementation. So a future CPU *could* do something other than octet bytes without precluding a POSIX implementation... and since LLVM supports integers of arbitrary bit sizes, it's even feasible that the implementation wouldn't be prohibitively difficult/effortful/complicated. – mtraceur Jul 01 '23 at 05:53