98

All the time I read sentences like

don't rely on 1 byte being 8 bit in size

use CHAR_BIT instead of 8 as a constant to convert between bits and bytes

et cetera. What real life systems are there today, where this holds true? (I'm not sure if there are differences between C and C++ regarding this, or if it's actually language agnostic. Please retag if neccessary.)

Community
  • 1
  • 1
Xeo
  • 129,499
  • 52
  • 291
  • 397
  • 9
    If you go around assuming all the world is on Intel, you'll be right 90% of the time. For now. But don't you want your code to work everywhere, and continue to work everywhere? – Paul Tomblin Apr 01 '11 at 16:18
  • 4
    The only current CPUs I'm aware of where CHAR_BIT may be other than 8 are DSPs which in many cases do not have byte addressable memory, so CHAR_BIT tends to be equal to the word size (e.g. 24 bits). Historically there were mainframes with 9 bit bytes (and 36 bit words) but I can't imagine there are too many of these still in use. – Paul R Apr 01 '11 at 16:20
  • 1
    While you find that in the c or c++ standards, it actually has more to do with the architecture of the underlying chips than the programming language. As long as you're working on the [desk|lap|palm]top you're unlikely to run into an exception soon, but delve into the embedded world and get ready for a ride. – dmckee --- ex-moderator kitten Apr 01 '11 at 16:21
  • 1
    +1. Good question. Even I want to know that. I think it should be tagged with C++ and C (so I added these tags). After all, it mentions `CHAR_BIT`. – Nawaz Jun 19 '11 at 07:36
  • 3
    IIRC, a byte was originally defined as the space needed for one character. Now, in C and C++, a char is by definition a byte - but there are also multi-byte character representations. Personally, my view is that the ancient history is precisely that - "byte" has meant "8 bits" for decades, and anyone who specifies that their platform has non-8-bit "bytes" (if any such platform exists these days) should really be using a different word. –  Jun 19 '11 at 07:45
  • @Steve314: The original byte was 6 bits. I can't imagine anybody wanting to use that concept of a byte. A better (but still not quite right) definition of a "byte" is that it is the smallest addressable unit of data. There are plenty of valid reasons some computer may have a byte size other than 8 bits. As dmckee mentioned, just delve into the embedded world. And do buckle up. The ride is a bit wild. – David Hammen Jun 19 '11 at 15:10
  • @Steve314: The fact that `char` is by definition a "byte" doesn't really make much difference. These are just two internal terms in C/C++ terminology, which can be used interchangeably. The "byte" from C/C++ terminology has no formal relation to the machine "byte". – AnT stands with Russia Jun 19 '11 at 15:59
  • 3
    @Paul Tomblin: Your reasoning is backwards. The allowance that `CHAR_BIT` be something other than 8 is not planning for the future; it's dragging along ancient history. POSIX finally shut the door on that chapter by mandating `CHAR_BIT==8`, and I don't think we'll *ever* see a new general-purpose (non-DSP) machine with non-8-bit bytes. Even on DSPs, making `char` 16- or 32-bit is just laziness by the compiler implementors. They could still make `char` 8-bit if they wanted, and hopefully C2020 or so will force them to... – R.. GitHub STOP HELPING ICE Sep 08 '11 at 02:27
  • 19
    @Steve314 "_a byte was originally defined as the space needed for one character._" A byte was and still is defined as the smallest addressable unit. "_"byte" has meant "8 bits" for decades_" No, a byte has meant smallest addressable unit for decades. "Octet" has meant "8 bits" for decades. – curiousguy Dec 23 '11 at 04:32
  • 2
    @R.. "_making char 16- or 32-bit is just laziness by the compiler implementors_" Nonsense. Making char the size of the smallest **natively** addressable unit is sticking to the hardware, which is exactly what people expect. **C and C++ are not Java.** – curiousguy Dec 23 '11 at 04:33
  • 1
    @AndreyT "_The "byte" from C/C++ terminology has no formal relation to the machine "byte"._" Nonsense. A C/C++ byte is expected to exactly match the machine's byte, unless the machine's byte has less that 8 bits. – curiousguy Dec 23 '11 at 04:35
  • 3
    @curiousguy - the term "byte" is usually credited to Werner Buchholtz in IBM in 1956, who described it as meaning "a group of bits used to encode a character, or the number of bits transmitted in parallel to and from input-output units". It originally had nothing to do with addressable units - that meaning was adopted in various contexts (e.g. C standard) despite "word" being defined as the "smallest addressable unit". Language evolves, and virtually everyone using the word "byte" in the recent decades meant 8 bits. If you don't believe me, check a few dictionaries. –  Dec 23 '11 at 08:33
  • 5
    @curiousguy: These days computers actually talk to one another. Having a byte that's anything other than an octet does nothing but severely break this important property. Same goes for using other backwards things like EBCDIC. – R.. GitHub STOP HELPING ICE Dec 23 '11 at 18:53
  • @R.. "_These days computers actually talk to one another. Having a byte that's anything other than an octet does nothing but severely break this important property_" So you are not able to explain which problems it might cause. – curiousguy Dec 24 '11 at 00:06
  • 2
    @curiousguy: Information interchange is all octet-based. If a byte is larger than 8 bits, then the implementation's `fputc` (and other output functions) must write more than a single octet to the communications channel (file/socket/etc.). This means `fputc('A', f)` will write at least two octets, leaving extra junk (likely one or more null octets) for the recipient to read. Conversely, `fgetc` would read multiple octets from the communication channel as a single `char` for the host, requiring further ugly processing to break it apart to use it. This is a brief summary in limited comment space. – R.. GitHub STOP HELPING ICE Dec 24 '11 at 00:30
  • @R.. "_must write more than a single octet to the communications channel_" not if they are expected to be useful. – curiousguy Dec 24 '11 at 03:01
  • 3
    To be conformant, it **must**. Round trip `fgetc` and `fputc` must be value-preserving. Not to mention, if it didn't, saving/loading binary data would not work. – R.. GitHub STOP HELPING ICE Dec 24 '11 at 05:06
  • 2
    @curiousguy: Absolutely incorrect. For example, hardware platforms that have 32-bit minimal addressable unit don't refer to these units as "bytes". C/C++ implementations on such platform will normally use 32-bit `char`s, which will represent "bytes" in C/C++sense of the term. – AnT stands with Russia Dec 24 '11 at 20:17
  • @AndreyT "_For example, hardware platforms that have 32-bit minimal addressable unit don't refer to these units as "bytes"._" so they call it...? – curiousguy Dec 25 '11 at 15:02
  • 4
    @curiousguy: Words. They call it words. Four-byte words, to be precise. The entire "minimal addressable unit (MAU)" is also used from time to time by those who don't want to feel like they are tying the notion of "word" to the addressing properties of the hardware platform. – AnT stands with Russia Dec 25 '11 at 19:41
  • **AndreyT**: C++ spec §1.8/1 says "Every byte has a unique address.". **@curiousguy**: see nothing in the spec says that address has to be a _native_ address, I see nothing prohibiting a compiler from emulating 8bit bytes on systems with 16+bit MAU. **R**: C++ doesn't specify that, and IO can work with any number of bits, just requires translation, similar to endian issues. I've done it. – Mooing Duck Feb 21 '12 at 21:17
  • 2
    @Mooing Bit-width translation can be done ad-hoc for specific systems, but I don't think the C/C++ standards specify enough to guarantee portable translation in general. When `CHAR_BIT` values differ, suddenly we also have to worry about bit-ordering. The order of bytes in an I/O stream is well-defined, but the order of bits in an I/O stream is not specified AFAIK. If one machine outputs 60 bits to an I/O stream and the other reads 8, which 8 are they? What happens to the 4 left over (60-7*8)? I just give up and require `CHAR_BIT = 8` for cross-machine I/O, which works most of the time. – jw013 Jan 04 '13 at 22:12
  • I found an IBM reference [COMPUTER USAGE COMMINUQUÉ Vol. 2 No. 3](http://archive.computerhistory.org/resources/text/Computer_Usage_Company/cuc.communique_vol2no3.1963.102651922.pdf) from 1963 that uses the term "byte" to refer to a variable 1 to 8 bits. Until now I had always thought that 1 byte = 8 bits, although character sizes and word sizes could be different. – Mark Ransom Apr 07 '16 at 23:14

9 Answers9

80

On older machines, codes smaller than 8 bits were fairly common, but most of those have been dead and gone for years now.

C and C++ have mandated a minimum of 8 bits for char, at least as far back as the C89 standard. [Edit: For example, C90, §5.2.4.2.1 requires CHAR_BIT >= 8 and UCHAR_MAX >= 255. C89 uses a different section number (I believe that would be §2.2.4.2.1) but identical content]. They treat "char" and "byte" as essentially synonymous [Edit: for example, CHAR_BIT is described as: "number of bits for the smallest object that is not a bitfield (byte)".]

There are, however, current machines (mostly DSPs) where the smallest type is larger than 8 bits -- a minimum of 12, 14, or even 16 bits is fairly common. Windows CE does roughly the same: its smallest type (at least with Microsoft's compiler) is 16 bits. They do not, however, treat a char as 16 bits -- instead they take the (non-conforming) approach of simply not supporting a type named char at all.

Jerry Coffin
  • 476,176
  • 80
  • 629
  • 1,111
  • I'll accept this answer because it puts everything important into one place. Maybe also add that bit from larsmans comment that `CHAR_BIT` is also self-documenting, which also made me use it now. I like self-documenting code. :) Thanks everyone for their answers. – Xeo Apr 03 '11 at 11:22
  • Could you please quote where in the C89 Standard it says `char` must be *minimum* of `8` bits? – Nawaz Jun 19 '11 at 07:40
  • 1
    @Nawaz: I don't have C89 handy, but C99 section 5.2.4.2.1 says regarding the values in that "implementation-defined values shall be equal or greater in magnitude (absolute value) to those shown, with the same sign." -- and then says that CHAR_BIT is 8. In other words, larger values are compliant, smaller ones are not. – David Hammen Jun 19 '11 at 15:16
  • @David: Hmm, I had read that, but didn't interpret it that way. Thanks for making me understand that. – Nawaz Jun 19 '11 at 15:23
  • 30
    Wow +1 for teaching me something new about how broken WinCE is... – R.. GitHub STOP HELPING ICE Sep 08 '11 at 02:18
  • there is something seriously wrong with Windows CE – A. H. Oct 25 '13 at 19:55
  • 2
    @Jerry, you sure about `char` and WinCE? I wrote a bit for WinCE 5.0 /x86 and /ARM; there was nothing wrong with `char` type. What they did is remove char-sized versions of **Win32 API** (so GetWindowTextW is there but GetWindowTextA is not etc.) – atzz Mar 17 '14 at 12:47
  • 2
    @atzz: Availability (or lack of it) of `char` obviously depends on the compiler, not the OS itself. I (at least think I) remember one of the early compilers for CE lacking `char`, but it's been quite a while since I wrote any code for CE, so I can't really comment on anything current (or close to it). – Jerry Coffin Mar 18 '14 at 05:09
24

TODAY, in the world of C++ on x86 processors, it is pretty safe to rely on one byte being 8 bits. Processors where the word size is not a power of 2 (8, 16, 32, 64) are very uncommon.

IT WAS NOT ALWAYS SO.

The Control Data 6600 (and its brothers) Central Processor used a 60-bit word, and could only address a word at a time. In one sense, a "byte" on a CDC 6600 was 60 bits.

The DEC-10 byte pointer hardware worked with arbitrary-size bytes. The byte pointer included the byte size in bits. I don't remember whether bytes could span word boundaries; I think they couldn't, which meant that you'd have a few waste bits per word if the byte size was not 3, 4, 9, or 18 bits. (The DEC-10 used a 36-bit word.)

John R. Strohm
  • 7,547
  • 2
  • 28
  • 33
  • 2
    Strings on the CDC were normally stored 10 bit characters to the word though, so it's much more reasonable to treat it as having a 6-bit byte (with strings normally allocated in 10-byte chunks). Of course, from a viewpoint of C or C++, a 6-bit byte isn't allowed though, so you'd have had to double them up and use a 12-bit word as "byte" (which would still work reasonably well -- the PPUs were 12-bit processors, and communication between the CPU and PPUs was done in 12-bit chunks. – Jerry Coffin Apr 01 '11 at 16:29
  • 3
    When I was doing 6600, during my undergrad days, characters were still only 6 bits. PASCAL programmers had to be aware of the 12-bit PP word size, though, because end-of-line only occurred at 12-bit boundaries. This meant that there might or might not be a blank after the last non-blank character in the line, and I'm getting a headache just thinking about it, over 30 years later. – John R. Strohm Apr 01 '11 at 17:36
  • Holy cow what a blast from the past! +1 for the memories! – Scott C Wilson Jun 19 '11 at 15:04
  • 7
    "TODAY, in the world of C++ on x86 processors" - You might want to talk to TI, Analog Devices (which have 16 bit DSPs), Freescale/NXP (24 bit DSPs), ARM, MIPS (both not x86), etc. In fact x86 is a minority of architectures and devices sold. But yes, a **binary** digital computer hardly has **trinary**(/etc.) digits. – too honest for this site Oct 10 '17 at 00:46
15

Unless you're writing code that could be useful on a DSP, you're completely entitled to assume bytes are 8 bits. All the world may not be a VAX (or an Intel), but all the world has to communicate, share data, establish common protocols, and so on. We live in the internet age built on protocols built on octets, and any C implementation where bytes are not octets is going to have a really hard time using those protocols.

It's also worth noting that both POSIX and Windows have (and mandate) 8-bit bytes. That covers 100% of interesting non-embedded machines, and these days a large portion of non-DSP embedded systems as well.

R.. GitHub STOP HELPING ICE
  • 208,859
  • 35
  • 376
  • 711
  • What real problems have you seen when a machine where a byte is not an octet communicates on the Internet? – curiousguy Dec 23 '11 at 04:38
  • I don't think that networking will be much of a problem. The very low level calls should be taking care of the encoding details. Any normal program should not be affected. – Wiz Jan 04 '12 at 15:47
  • 6
    They can't. `getc` and `putc` have to preserve `unsigned char` values round-trip, which means you can't just have "extra bits" in `char` that don't get read/written. – R.. GitHub STOP HELPING ICE Jan 04 '12 at 19:42
  • They probably use C99's `uint8_t`. Since POSIX requires it for all socket functions it should be available (although C99 does not require `uint8_t` to be defined). – Andreas Spindler Nov 06 '12 at 16:20
  • 14
    `uint8_t` **cannot** exist if `char` is larger than 8 bits, because then `uint8_t` would have padding bits, which are not allowed. – R.. GitHub STOP HELPING ICE Jan 05 '13 at 02:05
  • 1
    @R..: $7.20.1.1.2 (c11) says explicitly that there are no padding bits in `uintN_t`. $7.20.1.1.3 says *"these types are optional."* $3.6 defines `byte` as: *"addressable unit of data storage large enough to hold any member of the basic character set of the execution environment"* (I don't see the word "smallest" in the definition). There is a notion of internal vs. trailing padding. Can `uint8_t` have a trailing padding? Is there a requirement that `uint8_t` object is at least `CHAR_BIT`? (as it is with `_Bool` type). – jfs Mar 29 '16 at 17:41
  • 1
    @J.F.Sebastian: I have no idea where your notion of "trailing padding" came from or what it would mean. Per Representation of Types all objects have a *representation* which is an overlaid array `unsigned char[sizeof(T)]` which may consist partly of padding. – R.. GitHub STOP HELPING ICE Mar 29 '16 at 21:17
  • @R.. Have you tried to look up the phrase "trailing padding" in the current C standard? (I did it using n1570 draft) – jfs Mar 29 '16 at 21:37
  • 1
    @J.F.Sebastian: Yes. It's only used in relation to structures; it means padding after any of the members that exists as part of bringing the total struct size up to the total size it needs to be (usually just enough to increase the size to a multiple of its alignment). It has nothing to do with non-aggregate types. – R.. GitHub STOP HELPING ICE Mar 29 '16 at 23:12
  • 2
    @R.. One thing I don't get about your "they can't [communicate on the internet]" comment that I don't get, is that you reference `getc` and `putc`, but are those strongly relevant to the question of accessing the internet? Doesn't almost everything in the world access the internet through interfaces outside of the standard C library? Last I checked, you couldn't even get a `stdio.h` compatible object pointing to a network connection without first going through system-specific interfaces, could you? So is there any reason why details of `getc`/etc would preclude access to the internet? – mtraceur Mar 27 '18 at 00:30
  • @R.. I also see no conceptual problem with a hypothetical system avoiding dropping higher bits from a `char`, so long as at the lowest level interface they either require data reads/writes in multiples of 8-bits at a time (2 12-bit bytes would give you a clean 3-octet boundary, 8 9-bit bytes would give you a clean 9-octet boundary, etc) or buffered accordingly. I suspect such systems do not exist due to lack of demand, but is there any reason why it'd be an impossibility? – mtraceur Mar 27 '18 at 00:45
  • Lest my comments give the opposite impression, though, +1 to this answer, because it explains *why* it's reasonable to expect an 8-bit byte in all of the circumstances *when it is*, which really helps answering the fundamental question of "when should I worry about bytes/`char` being anything other than an octet" in a way that mere examples can't. – mtraceur Mar 27 '18 at 00:56
7

From Wikipedia:

The size of a byte was at first selected to be a multiple of existing teletypewriter codes, particularly the 6-bit codes used by the U.S. Army (Fieldata) and Navy. In 1963, to end the use of incompatible teleprinter codes by different branches of the U.S. government, ASCII, a 7-bit code, was adopted as a Federal Information Processing Standard, making 6-bit bytes commercially obsolete. In the early 1960s, AT&T introduced digital telephony first on long-distance trunk lines. These used the 8-bit µ-law encoding. This large investment promised to reduce transmission costs for 8-bit data. The use of 8-bit codes for digital telephony also caused 8-bit data "octets" to be adopted as the basic data unit of the early Internet.

Daniel A. White
  • 187,200
  • 47
  • 362
  • 445
6

As an average programmer on mainstream platforms, you do not need to worry too much about one byte not being 8 bit. However, I'd still use the CHAR_BIT constant in my code and assert (or better static_assert) any locations where you rely on 8 bit bytes. That should put you on the safe side.

(I am not aware of any relevant platform where it doesn't hold true).

Alexander Gessler
  • 45,603
  • 7
  • 82
  • 122
  • 2
    Besides being safe, `CHAR_BIT` is self-documenting. And I learned on SO that some embedded platforms apparently have 16-bit `char`. – Fred Foo Apr 01 '11 at 16:25
  • I realize that CHAR_BIT is meant to represent the byte size, but the beef I have with that term is that it really has less to do with chars and more to do with byte length. A newbie dev will likely read CHAR_BIT and think it has something to do with using UTF8 or something like that. It's an unfortunate piece of legacy IMO. – N8allan Jun 20 '14 at 18:43
4

Firstly, the number of bits in char does not formally depend on the "system" or on "machine", even though this dependency is usually implied by common sense. The number of bits in char depends only on the implementation (i.e. on the compiler). There's no problem implementing a compiler that will have more than 8 bits in char for any "ordinary" system or machine.

Secondly, there are several embedded platforms where sizeof(char) == sizeof(short) == sizeof(int) , each having 16 bits (I don't remember the exact names of these platforms). Also, the well-known Cray machines had similar properties with all these types having 32 bits in them.

AnT stands with Russia
  • 312,472
  • 42
  • 525
  • 765
  • While you can technically do anything you want when implementing a compiler, in a practical sense you need to conform to the operating system's ABI, and this generally forces all compilers for a particular system to use the same data representations. – Barmar Mar 07 '14 at 07:34
  • @Barmar: The need to conform to the operating systems ABI applies to interface data formats only. It does not impose any limitations onto the internal data formats of the implementation. The conformance can be (and typically is) achieved by using properly selected (and possible non-standard) types to describe the interface. For example, boolean type of Windows API (hiding behind `BOOL`) is different from `bool` of C++ or C. That does not create any problems for implementations. – AnT stands with Russia Mar 07 '14 at 07:58
  • Many APIs and ABIs are specified in terms of standard C data types, rather than abstract types. POSIX has some abstract types (e.g. `size_t`), but makes pretty liberal use of `char` and `int` as well. The ABI for particular POSIX implementations must then specify how these are represented so that interfaces will be compatible across implementations (you aren't required to compiler applications with the same implementation as the OS). – Barmar Mar 07 '14 at 08:05
  • @Barmar: That is purely superficial. It is not possible to specify ABIs in terms of truly *standard* language-level types. Standard types are flexible by definition, while ABI interface types are frozen. If some ABI uses standard type names in its specification, it implies (and usually explicitly states) that these types are required to have some specific frozen representation. Writing header files in terms of standard types for such ABIs will only work for those specific implementation that adhere to the required data format. – AnT stands with Russia Mar 07 '14 at 17:09
  • Note that for the actual implementation "ABI in terms of standard types" will simply mean that some header files are written in therms of standard types. However, this does not in any way preclude the implementation from changing the representation of standard types. The implementation just has to remember that those header files have to be rewritten in terms of some other types (standard or not) to preserve the binary compatibility. – AnT stands with Russia Mar 07 '14 at 17:12
  • For example, today I specify some ABI in terms of type `int` and presume (and explicitly state) that in this ABI `int` has 32 bits. Tomorrow my compiler gets significantly upgraded and its `int` changes from 32 bits to 64 bits. To preserve binary compatibility all I have to do in this case is replace `int` with `int32_t` in that ABI's header files. I don't even have to change the documentation of the ABI, since it explicitly states that it expects 32-bit `int`. – AnT stands with Russia Mar 07 '14 at 17:16
  • @AndreyT: not in C++, you can't ... – SamB Nov 12 '14 at 01:41
  • @SamB: Huh? I "can't" what exactly? – AnT stands with Russia Nov 12 '14 at 02:06
2

I do a lot of embedded and currently working on DSP code with CHAR_BIT of 16

dubnde
  • 4,359
  • 9
  • 43
  • 63
2

In history, there's existed a bunch of odd architectures that where not using native word sizes that where multiples of 8. If you ever come across any of these today, let me know.

  • The first commerical CPU by Intel was the Intel 4004 (4-bit)
  • PDP-8 (12-bit)

The size of the byte has historically been hardware dependent and no definitive standards exist that mandate the size.

It might just be a good thing to keep in mind if your doing lots of embedded stuff.

John R. Strohm
  • 7,547
  • 2
  • 28
  • 33
John Leidegren
  • 59,920
  • 20
  • 131
  • 152
1

Adding one more as a reference, from Wikipedia entry on HP Saturn:

The Saturn architecture is nibble-based; that is, the core unit of data is 4 bits, which can hold one binary-coded decimal (BCD) digit.

auselen
  • 27,577
  • 7
  • 73
  • 114