Making a program portable between machines that have different number of bits in a "machine byte"

Question

We are all fans of portable C/C++ programs.

We know that sizeof(char) or sizeof(unsigned char) is always 1 "byte". But that 1 "byte" doesn't mean a byte with 8 bits. It just means a "machine byte", and the number of bits in it can differ from machine to machine. See this question.

Suppose you write out the ASCII letter 'A' into a file foo.txt. On any normal machine these days, which has a 8-bit machine byte, these bits would get written out:

01000001

But if you were to run the same code on a machine with a 9-bit machine byte, I suppose these bits would get written out:

001000001

More to the point, the latter machine could write out these 9 bits as one machine byte:

100000000

But if we were to read this data on the former machine, we wouldn't be able to do it properly, since there isn't enough room. Somehow, we would have to first read one machine byte (8 bits), and then somehow transform the final 1 bit into 8 bits (a machine byte).

How can programmers properly reconcile these things?

The reason I ask is that I have a program that writes and reads files, and I want to make sure that it doesn't break 5, 10, 50 years from now.

In all fairness, it is not worth the effort, an in the event you need to read 9 bit files, you are usually better off writing a translation program to convert them to an 8 bit format. — nos, Jan 18 '13 at 12:26
`I want to make sure that it doesn't break 5, 10, 50 years from now.` I would say that it is responsability for the programmer from 5o years from now. Anyway, if you want portability, use JSON, XML, or even ASN.1. — SJuan76, Jan 18 '13 at 12:29
@AlexeyFrunze If a future 9 bit machines writes out a byte it would end up being 9 bits on storage. Just as if you need to write 7 bit files with a platform that has 8 bit bytes (as most have), you'll need to do a whole lot of bit fiddling. — nos, Jan 18 '13 at 12:30
Oh, and BTW the issue is not about making the program compatible, but with the file format. — SJuan76, Jan 18 '13 at 12:39
There is no issue. When this "9 bit" machine comes, you'll make sure it talks 8 or 16 or some other standard... By this very same logic, an overhead for 16 bits will be justified when this "16 bit" machine comes anyway. — ActiveTrayPrntrTagDataStrDrvr, Jan 18 '13 at 12:42
C/C++ programmers can barely write proper programs for 8 bit bytes as it is. If no better programming language than C or C++ appears within the next 50 years, I doubt you need to worry, since you have most likely been killed by a software error in a car/machine/plane/nuclear reactor long before then. — Lundin, Jan 18 '13 at 14:01
Have you considered that on the IBM mainframe I use *right now* the code for 'A' is `11000001`. How do we cope with that? — Bo Persson, Jan 18 '13 at 22:44
Just to clarify (you could clarify this in your question): are you actually asking this: how to read from mass storage or network data packets, which use some different number of bits per byte, than the CPU architecture uses? This not really somehting defined in the scope of C language or standard library, so it's kind of impossible to prepare for undefined case. — hyde, Jan 22 '13 at 06:51

score 7 · Accepted Answer · edited May 23 '17 at 10:30

How can programmers properly reconcile these things?

By doing nothing. You've presented a filesystem problem.

Imagine that dreadful day when the first of many 9-bit machines is booted up, ready to recompile your code and process that ASCII letter A that you wrote to a file last year.

To ensure that a C/C++ compiler can reasonably exist for this machine, this new computer's OS follows the same standards that C and C++ assume, where files have a size measured in bytes.

...There's already a little problem with your 8-bit source code. There's only about a 1-in-9 chance each source file is a size that can even exist on this system.

Or maybe not. As is often the case for me, Johannes Schaub - litb has pre-emptively cited the standard regarding valid formats for C++ source code.

Physical source file characters are mapped, in an implementation-defined manner, to the basic source character set (introducing new-line characters for end-of-line indicators) if necessary. Trigraph sequences (2.3) are replaced by corresponding single-character internal representations. Any source file character not in the basic source character set (2.2) is replaced by the universal-character-name that des- ignates that character. (An implementation may use any internal encoding, so long as an actual extended character encountered in the source file, and the same extended character expressed in the source file as a universal-character-name (i.e. using the \uXXXX notation), are handled equivalently.)

"In an implementation-defined manner." That's good news...as long as some method exists to convert your source code to any 1:1 format that can be represented on this machine, you can compile it and run your program.

So here's where your real problem lies. If the creators of this computer were kind enough to provide a utility to bit-extend 8-bit ASCII files so they may be actually stored on this new machine, there's already no problem with the ASCII letter A you wrote long ago. And if there is no such utility, then your program already needs maintenance and there's nothing you could have done to prevent it.

Edit: The shorter answer (addressing comments that have since been deleted)

The question asks how to deal with a specific 9-bit computer...

With hardware that has no backwards-compatible 8-bit instructions
With an operating system that doesn't use "8-bit files".
With a C/C++ compiler that breaks how C/C++ programs have historically written text files.

Damian Conway has an often-repeated quote comparing C++ to C:

"C++ tries to guard against Murphy, not Machiavelli."

He was describing other software engineers, not hardware engineers, but the intention is still sound because the reasoning is the same.

Both C and C++ are standardized in a way that requires you to presume that other engineers want to play nice. Your Machiavellian computer is not a threat to your program because it's a threat to C/C++ entirely.

Returning to your question:

How can programmers properly reconcile these things?

You really have two options.

Accept that the computer you describe would not be appropriate in the world of C/C++
Accept that C/C++ would not be appropriate for a program that might run on the computer you describe

hyde · Answer 2 · 2013-01-22T10:57:39.423

Only way to be sure is to store data in text files, numbers as strings of number characters, not some amount of bits. XML using UTF-8 and base 10 should be pretty good overall choice for portability and readability, as it is well defined. If you want to be paranoid, keep the XML simple enough, so that in a pinch it can be easily parsed with simple custom parser, in case a real XML parser is not readily available for your hypothetical computer.

When parsing numbers, and it is bigger than what fits in your numeric data type, well, that's an error situation you need to handle as you see fit in the context. Or use a "big int" library, which can then handle arbitrarily large numbers (with an order of magnitude performance hit compared to "native" numeric data types, of course).

If you need to store bit fields, then store bit fields, that is number of bits and then bit values in whatever format.

If you have a specific numeric range, then store the range, so you can explicitly check if they fit in available numeric data types.

Byte is pretty fundamental data unit, so you can not really transfer binary data between storages with different amount of bits, you have to convert, and to convert you need to know how the data is formatted, otherwise you simply can not convert multi-byte values correctly.

Adding actual answer:

In you C code, do not handle byte buffers, except in isolated functions which you will then modify as appropriate for CPU architecture. For example .JPEG handling functions would take either a struct wrapping the image data in unspecified way, or a file name to read the image from, but never a raw char* to byte buffer.
Wrap strings in a container which does not assume encoding (presumably it will use UTF-8 or UTF-16 on 8-bit byte machine, possibly currently non-standard UTF-9 or UTF-18 on 9-bit byte machine, etc).
Wrap all reads from external sources (network, disk files, etc) into functions which return native data.
Create code where no integer overflows happen, and do not rely on overflow behavior in any algorithm.
Define all-ones bitmasks using ~0 (instead of 0xFFFFFFFF or something)
Prefer IEEE floating point numbers for most numeric storage, where integer is not required, as those are independent of CPU architecture.
Do not store persistent data in binary files, which you may have to convert. Instead use XML in UTF-8 (which can be converted to UTF-X without breaking anything, for native handling), and store numbers as text in the XML.
Same as with different byte orders, except much more so, only way to be sure is to port your program to actual machine with different number of bits, and run comprehensive tests. If this is really important, then you may have to first implement such a virtual machine, and port C-compiler and needed libraries for it, if you can't find one otherwise. Even careful (=expensive) code review will only take you part of the way.

score 2 · Answer 3 · answered Jan 18 '13 at 12:38

2

if you're planning to write programs for Quantum Computers(which will be available in the near future for us to buy), then start learning Quantum Physics and take a class on programming them.

Unless you're planning for a boolean computer logic in the near future, then.. my question is how will you make it sure that the filesystem available today will not be the same tomorrow? or how a file stored with 8 bit binary will remain portable in the filesystems of tomorrow?

If you want to keep your programs running through generations, my suggestion is create your own computing machine, with your own filesystem and your own operating system, and change the interface as the needs of tomorrow change.

My problem is, the computer system I programmed a few years ago doesn't exist(Motorola 68000) anymore for normal public, and the program heavily relied on the machine's byte order and assembly language. Not portable anymore :-(

answered Jan 18 '13 at 12:38

Aniket Inge

25,375
5
50
78

([Freescale Coldfire](http://en.wikipedia.org/wiki/Coldfire) family is closely related to 68000. Not 100% portable, but thousands of people have already done such porting before you, so there should be plenty of documentation and help to be found.) – Lundin Jan 18 '13 at 13:51
1

Yeah, well I'll make my own computer, with blackjack and hookers. In fact forget about the computer. – Shahbaz Jan 18 '13 at 14:12
@Shahbaz suit yourself :-) – Aniket Inge Jan 18 '13 at 14:21
@Aniket, it's just that your third paragraph reminded me of that, made me laugh :D – Shahbaz Jan 18 '13 at 14:42

score 2 · Answer 4 · answered Jan 18 '13 at 14:05

If you're talking about writing and reading binary data, don't bother. There is no portability guarantee today, other than that data you write from your program can be read by the same program compiled with the same compiler (including command-line settings). If you're talking about writing and reading textual data, don't worry. It works.

score 2 · Answer 5 · answered Jan 21 '13 at 00:48

First: The original practical goal of portability is to reduce work; therefore if portability requires more effort than non-portability to achieve the same end result, then writing portable code in such case is no longer advantageous. Do not target 'portability' simply out of principle. In your case, a non-portable version with well-documented notes regarding the disk format is a more efficient means of future-proofing. Trying to write code that somehow caters to any possible generic underlying storage format will probably render your code nearly incomprehensible, or so annoying to maintain that it will fall out of favor for that reason (no need to worry about future-proofing if no one wants to use it anyway 20 yrs from now).

Second: I don't think you have to worry about this, because the only realistic solution to running 8-bit programs on a 9-bit machine (or similar) is via Virtual Machines.

It is extremely likely that anyone in the near or distant future using some 9+ bit machine will be able to start up a legacy x86/arm virtual machine and run your program that way. Hardware 25-50 years from now should have no problem what-so-ever of running entire virtual machines just for the sake of executing a single program; and that program will probably still load, execute, and shutdown faster than it does today on current native 8-bit hardware. (some cloud services today in fact, already trend toward starting entire VMs just to service individual tasks)

I strongly suspect this is the only means by which any 8-bit program would be run on 9/other-bit machines, due to the points made in other answers regarding the fundamental challenges inherent to simply loading and parsing 8-bit source code or 8-bit binary executables.

It may not be remotely resembling "efficient" but it would work. This also assumes, of course, that the VM will have some mechanism by which 8-bit text files can be imported and exported from the virtual disk onto the host disk.

As you can see, though, this is a huge problem that extends well beyond your source code. The bottom line is that, most likely, it will be much cheaper and easier to update/modify or even re-implement-from-scratch your program on the new hardware, rather than to bother trying to account for such obscure portability issues up-front. The act of accounting for it almost certainly requires more effort than just converting the disk formats.

score 1 · Answer 6 · answered Jan 18 '13 at 12:29

1

8-bit bytes will remain until end of time, so don't sweat it. There will be new types, but this basic type will never ever change.

answered Jan 18 '13 at 12:29

Jonas Byström

25,316
23
100
147

1

Only on processors that support 8-bit bytes. There are plenty in common use that only support 16, 32, 64 or 24-bit bytes. – Mike Seymour Jan 18 '13 at 13:57
He's asking for (near) future compatibility, and as hard as it is to predict the future I can state for the record that this one won't change. It's currently [very uncommon](http://stackoverflow.com/questions/5516044/system-where-1-byte-8-bit#5516109), they say and trend is certainly going down. – Jonas Byström Jan 18 '13 at 15:27
@MikeSeymour You could also mention the common processor with non-8-bit bytes, and some estimate on how common they are, exactly... And if they are commonly programmed using C. – hyde Jan 18 '13 at 16:49
1

@hyde: In my experience, the Motorola 56k and Sharc ADSP series of processors have 24 and 32 bit bytes respectively; there are plenty of similar DSP-style processors in common use, most of which can be programmed with C or C++. – Mike Seymour Jan 19 '13 at 03:52
@JonasByström: If we assume that the question is resticted to PC-style architectures, then it's almost certainly safe to assume that they will always have 8-bit bytes. But C and C++ are widely used on other architectures, so truly portable code can't make that assumption. – Mike Seymour Jan 19 '13 at 03:54
@MikeSeymour are you sure the `char` data type on C-compilers for those processors has more than 8 bits? Just asking, because that would make handling any 8 bit character based external data a pain. – hyde Jan 22 '13 at 11:01
1

@hyde: Absolutely sure - those processors can't address less than a word of data, and the compilers don't try to fake smaller addressable units. Dealing efficiently with 8-bit characters is indeed a pain, especially on the 24-bit 56k: not only do you have to deal with packing 3 characters in each word, but you also have to deal with a compiler and a standard library with different ideas about how they should be packed. `char const packed hello[] = "leh\0ol";` – Mike Seymour Jan 22 '13 at 11:21
@MikeSeymour: here's another prediction for ya! :) Since even phones use ARM CPU:s these days and prices are dropping as always, there will be no need for odd architectures in the future. – Jonas Byström Jan 22 '13 at 12:10

Gilbert · Answer 7 · 2013-01-26T12:43:13.970

Kind of late but I can't resist this one. Predicting the future is tough. Predicting the future of computers can be more hazardous to your code than premature optimization.

Short Answer
While I end this post with how 9-bit systems handled portability with 8-bit bytes this experience also makes me believe 9-bit byte systems will never arise again in general purpose computers.

My expectation is that future portability issues will be with hardware having a minimum of 16 or 32 bit access making CHAR_BIT at least 16. Careful design here may help with any unexpected 9-bit bytes.

QUESTION to /. readers: is anyone out there aware of general purpose CPUs in production today using 9-bit bytes or one's complement arithmetic? I can see where embedded controllers may exist, but not much else.

Long Answer
Back in the 1990s's the globalization of computers and Unicode made me expect UTF-16, or larger, to drive an expansion of bits-per-character: CHAR_BIT in C. But as legacy outlives everything I also expect 8-bit bytes to remain an industry standard to survive at least as long as computers use binary.

BYTE_BIT: bits-per-byte (popular, but not a standard I know of)
BYTE_CHAR: bytes-per-character

The C standard does not address a char consuming multiple bytes. It allows for it, but does not address it.

3.6 byte: (final draft C11 standard ISO/IEC 9899:201x)
addressable unit of data storage large enough to hold any member of the basic character set of the execution environment.
NOTE 1: It is possible to express the address of each individual byte of an object uniquely.

NOTE 2: A byte is composed of a contiguous sequence of bits, the number of which is implementation-defined. The least significant bit is called the low-order bit; the most significant bit is called the high-order bit.

Until the C standard defines how to handle BYTE_CHAR values greater than one, and I'm not talking about “wide characters”, this the primary factor portable code must address and not larger bytes. Existing environments where CHAR_BIT is 16 or 32 are what to study. ARM processors are one example. I see two basic modes for reading external byte streams developers need to choose from:

Unpacked: one BYTE_BIT character into a local character. Beware of sign extensions.
Packed: read BYTE_CHAR bytes into a local character.

Portable programs may need an API layer that addresses the byte issue. To create on the fly and idea I reserve the right to attack in the future:

  #define BYTE_BIT 8                     // bits-per-byte
  #define BYTE_CHAR (CHAR_BIT/BYTE_BIT)  //bytes-per-char

  size_t byread(void  *ptr,
                size_t size,     // number of BYTE_BIT bytes
                int    packing,  // bytes to read per char
                                 // (negative for sign extension)
                FILE  *stream);

  size_t bywrite(void  *ptr,
                size_t size,
                int    packing,
                FILE  *stream);

size number BYTE_BIT bytes to transfer.
packing bytes to transfer per char character. While typically 1 or BYTE_CHAR it could indicate BYTE_CHAR of the external system, which can be smaller or larger than the current system.
Never forget endianness clashes.

Good Riddance To 9-Bit Systems:
My prior experience with writing programs for 9-bit environments lead me to believe we will not see such again, unless you happen to need a program to run on a real old legacy system somewhere. Likely in a 9-bit VM on a 32/64-bit system. Since year 2000 I sometimes make a quick search for, but have not seen, references to current current descendants of the old 9-bit systems.

Any, highly unexpected in my view, future general purpose 9-bit computers would likely either have an 8-bit mode, or 8-bit VM (@jstine), to run programs under. The only exception would be special purpose built embedded processors, which general purpose code would not likely to run on anyway.

In days of yore one 9-bit machine was the PDP/15. A decade of wrestling with a clone of this beast make me never expect to see 9-bit systems arise again. My top picks on why follow:

The extra data bit came from robbing the parity bit in core memory. Old 8-bit core carried a hidden parity bit with it. Every manufacturer did it. Once core got reliable enough some system designers switched the already existing parity to a data bit in a quick ploy to gain a little more numeric power and memory addresses during times of weak, non MMU, machines. Current memory technology does not have such parity bits, machines are not so weak, and 64-bit memory is so big. All of which should make the design changes less cost effective then the changes were back then.
Transferring data between 8-bit and 9-bit architectures, including off-the-shelf local I/O devices, and not just other systems, was a continuous pain. Different controllers on the same system used incompatible techniques:
1. Use the low order 16-bits of 18 bit words.
2. Use the low-order 8 bits of 9-bit bytes where the extra high-order bit might be set to the parity from bytes read from parity sensitive devices.
3. Combine the low-order 6 bits of three 8-bit bytes to make 18 bit binary words.
Some controllers allowed selecting between 18-bit and 16-bit data transfers at run time. What future hardware, and supporting system calls, your programs would find just can't be predicted in advance.
Connecting to the 8-bit Internet will be horrid enough by itself to kill any 9-bit dreams someone has. They got away with it back then as machines were less interconnected in those times.
Having something other than an even multiple of 2 bits in byte-addressed storage brings up all sorts of troubles. Example: if you want an array of thousands of bits in 8-bit bytes you can unsigned char bits[1024] = { 0 }; bits[n>>3] |= 1 << (n&7);. To fully pack 9-bits you must do actual divides, which brings horrid performance penalties. This also applies to bytes-per-word.
Any code not actually tested on 9-bit byte hardware may well fail on it's first actual venture into the land of unexpected 9-bit bytes, unless the code is so simple that refactoring it in the future for 9-bits is only a minor issue. The prior byread()/bywrite() may help here but it would likely need an additional CHAR_BIT mode setting to set the transfer mode, returning how the current controller arranges the requested bytes.

To be complete anyone who wants to worry about 9-bit bytes for the educational experience may need to also worry about one's complement systems coming back; something else that seems to have died a well deserved death (two zeros: +0 and -0, is a source of ongoing nightmares... trust me). Back then 9-bit systems often seemed to be paired with one's complement operations.

score 1 · Answer 8 · answered Jan 23 '13 at 18:46

I think the likelihood of non-8-bit bytes in future computers is low. It would require rewriting so much, and for so little benefit. But if it happens...

You'll save yourself a lot of trouble by doing all calculations in native data types and just rewriting inputs. I'm picturing something like:

template<int OUTPUTBITS, typename CALLABLE>
class converter {
  converter(int inputbits, CALLABLE datasource);
  smallestTypeWithAtLeast<OUTPUTBITS> get();
};

Note that this can be written in the future when such a machine exists, so you need do nothing now. Or if you're really paranoid, make sure get just calls datasource when OUTPUTBUTS==inputbits.

score -1 · Answer 9 · answered Jan 24 '13 at 04:12

In a programming language, a byte is always 8-bits. So, if a byte representation has 9-bits on some machine, for whatever reason, its up to the C compiler to reconcile that. As long as you write text using char, - say, if you write/read 'A' to a file, you would be writing/reading only 8-bits to the file. So, you should not have any problem.

Making a program portable between machines that have different number of bits in a "machine byte"

9 Answers9