Writing numerical data to file as binary vs. written out?

Question

I'm writing floating point numbers to file, but there's two different ways of writing these numbers and I'm wondering which to use.

The two choices are:

write the raw representative bits to file
write the ascii representation of the number to file

Option 1 seems like it would be more practical to me, since I'm truncating each float to 4 bytes. And parsing each number can be skipped entirely when reading. But in practice, I've only ever seen option 2 used.

The data in question is 3D model information, where small file sizes and quick reading can be very advantageous, but again, no existing 3D model format does this that I know of, and I imagine there must be a good reason behind it.

My question is, what reasons are there for choosing to write the written out form of numbers, instead of the bit representation? And are there situations where using the binary form would be preferred?

Not sure if it's the primary reason, but one reason is that you don't have to worry about different [endianness](https://en.wikipedia.org/wiki/Endianness) between machines. — Benjamin Lindley, Jul 11 '15 at 22:55
Write as binary for efficient use by computers and as text for efficient use by humans (e.g., debugging). — James Adkison, Jul 11 '15 at 22:56
`as a floating point number will always consume a fixed number of bytes, and only 4 at that.` Wrong. Even worse, the size is only one of many problems. — deviantfan, Jul 11 '15 at 22:57
@deviantfan - sorry, I shouldn't have written that as generally. In my case though, it will always be written as 4 bytes, truncating any extra precision — Anne Quinn, Jul 11 '15 at 23:03

yzt · Accepted Answer · 2015-07-12T00:16:36.947

First of all, floats are 4 bytes on any architecture you might encounter normally, so nothing is "truncated" when you write the 4 bytes from memory to a file.

As for your main question, many regular file formats are designed for "interoperability" and ease of reading/writing. That's why text, which is an almost universally portable representation (character encoding issues notwithstanding,) is used most often.

For example, it is very easy for a program to read the string "123" from a text file and know that it represents the number 123.

(But note that text itself is not a format. You might choose to represent all your data elements as ASCII/Unicode/whatever strings of characters, and put all these strings along with each other to form a text file, but you still need to specify exactly what each element means and what data can be found where. For example, a very simplistic text-based 3D triangle mesh file format might have the number of triangles in the mesh on the first line of the file, followed by three triplets of real numbers on the next N lines, each, specifying the 9 numbers required for the X,Y,Z coordinates of the three vertices of a triangle.)

On the other hand are the binary formats. These usually have in them the data elements in the same format as they are found in computer memory. This means an integer is represented with a fixed number of bytes (1, 2, 4 or 8, usually in "two's complement" format) or a real number is represented by 4 or 8 bytes in IEEE 754 format. (Note that I'm omitting a lot of details for the sake of staying on point.)

Main advantages of a binary format are:

They are usually smaller in size. A 32-bit integer written as an ASCII string can get upto 10 or 11 bytes (e.g. -1000000000) but in binary it always takes up 4 bytes. And smaller means faster-to-transfer (over network, from disk to memory, etc.) and easier to store.
Each data element is faster to read. No complicated parsing is required. If the data element happens to be in the exact format/layout that your platform/language can work with, then you just need to transfer the few bytes from disk to memory and you are done.
Even large and complex data structures can be laid out on disk in exactly the same way as they would have been in memory, and then all you need to do to "read" that format would be to get that large blob of bytes (which probably contains many many data elements) from disk into memory, in one easy and fast operation, and you are done.

But that 3rd advantage requires that you match the layout of data on disk exactly (bit for bit) with the layout of your data structures in memory. This means that, almost always, that file format will only work with your code and your code only, and not even if you change some stuff around in your own code. This means that it is not at all portable or interoperable. But it is damned fast to work with!

There are disadvantages to binary formats too:

You cannot view or edit or make sense of them in a simple, generic software like a text editor anymore. You can open any XML, JSON or config file in any text editor and make some sense of it quite easily, but not a JPEG file.
You will usually need more specific code to read in/write out a binary format, than a text format. Not to mention specification that document what every bit of the file should be. Text files are generally more self-explanatory and obvious.
In some (many) languages (scripting and "higher-level" languages) you usually don't have access to the bytes that make up an integer or a float, not to read them nor to write them. This means that you'll lose most of the speed advantages that binary files give you when you are working in a lower-level language like C or C++.
Binary in-memory formats of primitive data types are almost always tied to the hardware (or more generally, the whole platform) that the memory is attached to. When you choose to write the same bits from memory to a file, the file format becomes hardware-dependent as well. One hardware might not store floating-point real numbers exactly the same way as another, which means binary files written on one cannot be read on the other naively (care must be taken and the data carefully converted into the target format.) One major difference between hardware architectures is know as "endianness" which affects how multibyte primitives (e.g. a 4-byte integer, or an 8-byte float) are expected to be stored in memory (from highest-order byte to the lowest-order, or vice versa, which are called "big endian" and "little endian" respectively.) Data written to a binary file on a big-endian architecture (e.g. PowerPC) and read verbatim on a little-endian architecture (e.g. x86) will have all the bytes in each primitive swapped from high-value to low-value, which means all (well, almost all) the values will be wrong.

Since you mention 3D model data, let me give you an example of what formats are used in a typical game engine. The game engine runtime will most likely need the most speed it can have in reading the models, and 3D models are large, so usually it has a very specific, and not-at-all-portable format for its model files. But that format would most likely not be supported by any modeling software. So you need to write a converter (also called an exporter or importer) that would take a common, generally-used format (e.g. OBJ, DAE, etc.) and convert that into the engine-specific, proprietary format. But as I mentioned, reading/transferring/working-with a text-based format is easier than a binary format, so you usually would choose a text-based common format to export your models into, then run the converter on them to the optimized, binary, engine-specific runtime format.

What about endianess? Take a case where I write a 32-bit value to a file on a Big Endian platform and read it back on a Little Endian platform. Will the number be the same on the Little Endian platform? — Thomas Matthews, Jul 11 '15 at 23:55
@ThomasMatthews: It won't be the same, unless the number happens to be a byte-wise palindrom (e.g. 0x12343412.) — yzt, Jul 12 '15 at 00:17

score 3 · Answer 2 · edited May 24 '23 at 08:22

You might prefer binary format if:

You want more compact encoding (fewer bytes - because text encoding will probably take more space).
Precision - because if you encode as text you might lose precision - but maybe there are ways to encode as text without losing precision*.
Performance is probably also another advantage of binary encoding.

Since you mention data in question is 3D model simulation, compactness of encoding (maybe also performance) and precision maybe relevant for you. On the other hand, text encoding is human readable.

That said, with binary encoding you typically have issues like endianness, and that float representation maybe different on different machines, but here is a way to encode floats (or doubles) in binary format in a portable way:

uint64_t pack754(long double f, unsigned bits, unsigned expbits)
{
    long double fnorm;
    int shift;
    long long sign, exp, significand;
    unsigned significandbits = bits - expbits - 1; // -1 for sign bit

    if (f == 0.0) return 0; // get this special case out of the way

    // check sign and begin normalization
    if (f < 0) { sign = 1; fnorm = -f; }
    else { sign = 0; fnorm = f; }

    // get the normalized form of f and track the exponent
    shift = 0;
    while(fnorm >= 2.0) { fnorm /= 2.0; shift++; }
    while(fnorm < 1.0) { fnorm *= 2.0; shift--; }
    fnorm = fnorm - 1.0;

    // calculate the binary form (non-float) of the significand data
    significand = fnorm * ((1LL<<significandbits) + 0.5f);

    // get the biased exponent
    exp = shift + ((1<<(expbits-1)) - 1); // shift + bias

    // return the final answer
    return (sign<<(bits-1)) | (exp<<(bits-expbits-1)) | significand;
}

*: In C, since C99 there seems a way to do this, but still I think it will take more space.

Writing numerical data to file as binary vs. written out?

2 Answers2