How to handle portability issues in a binary file format

Question

I'm designing a binary file format to store strings[without terminating null to save space] and binary data.

i. What is the best way to deal with little/big endian systems? i.a Would converting everything to network byte order and back with ntohl()/htonl() work?

ii. Will the packed structures be the same size on x86, x64 and arm?

iii. Are their any inherent weakness with this approach?

struct __attribute__((packed)) Header {
    uint8_t magic;
    uint8_t flags;
};

struct __attribute__((packed)) Record {
    uint64_t length;
    uint32_t crc;
    uint16_t year;
    uint8_t day;
    uint8_t month;
    uint8_t hour;
    uint8_t minute;
    uint8_t second;
    uint8_t type;
};

Tester code I'm using the develop the format:

#include <stdlib.h>
#include <unistd.h>
#include <stdio.h>
#include <limits.h>
#include <strings.h>
#include <stdint.h>
#include <sys/stat.h>
#include <fcntl.h>
#include <string.h>

struct __attribute__((packed)) Header {
    uint8_t magic;
    uint8_t flags;
};

struct __attribute__((packed)) Record {
    uint64_t length;
    uint32_t crc;
    uint16_t year;
    uint8_t day;
    uint8_t month;
    uint8_t hour;
    uint8_t minute;
    uint8_t second;
    uint8_t type;
};

    int main(void)
    {

        int fd = open("test.dat", O_RDWR|O_APPEND|O_CREAT, 444);
        struct Header header = {1, 0};
        write(fd, &header, sizeof(header));
        char msg[] = {"BINARY"};
        struct Record record = {strlen(msg), 0, 0, 0, 0, 0, 0, 0};
        write(fd, &record, sizeof(record));
        write(fd, msg, record.length);
        close(fd);
        fd = open("test.dat", O_RDWR|O_APPEND|O_CREAT, 444);


        read(fd, &header, sizeof(struct Header));
        read(fd, &record, sizeof(struct Record));
        int len = record.length;
        char c;
        while (len != 0) {
            read(fd, &c, 1);
            len--;
            printf("%c", c);
        }
        close(fd);
    }

I'm voting to close, as too broad for SO - sorry, try a different site! — too honest for this site, Jun 24 '15 at 20:29
@Olaf: I would vote to keep: this is a very practical real-world sort of question that comes up all the time. Just because it doesn't have a single cut-and-dried answer doesn't mean it doesn't deserve consideration. (With that said, though, I am not a SO regular, so if the consensus is that there are certain swaths of practical, real-world programming questions that this site is _not_ for, I'm in no position to argue.) — Steve Summit, Jun 24 '15 at 20:37
@SteveSummit: I do agree actually in that the question is in fact interesting (mind my "sorry"). However, it is off-topic for SO. I really hope the OP finds another site (not sure, if there is one on stack exchange). For the vote: well, that is clearly _my_ opinion. If others think different, it _will_ stay open. I can live with that. — too honest for this site, Jun 24 '15 at 20:40
"without terminating null to save space"... Really? A terminating NULL byte is a single byte per string. Your alternative is to store a length, which could be a single byte, but that would limit strings to 256 bytes or less. More usefully, the length would be multiple bytes, which actually uses more space than a terminating NULL... — twalberg, Jun 24 '15 at 21:16
FYI, http://stackoverflow.com/questions/1577161/passing-a-structure-through-sockets-in-c — Giorgi Moniava, Jun 24 '15 at 21:34

Steve Summit · Accepted Answer · 2015-06-25T14:24:45.683

7

i. Defining the file to be in one order and converting to and from "internal" order, if necessary, when reading/writing (perhaps with ntohl and the like) is, in my opinion, the best approach.

ii. I do not trust packed structures. They might work for this approach for those platforms, but there are no guarantees.

iii. Reading and writing binary files using fread and fwrite on whole structs is (again in my opinion) an inherently weak approach. You maximize the likelihood that you will be bitten by word size problems, padding and alignment problems, and byte order problems.

What I like to do is write little functions like get16() and put32() that read and write a byte at a time and so are inherently insensitive to word size and byte order difficulties. Then I write straightforward putHeader and getRecord functions (and the like) in terms of these.

unsigned int get16(FILE *fp)
{
    unsigned int r;
    r = getc(fp);
    r = (r << 8) | getc(fp);
    return r;
}

void put32(unsigned long int x, FILE *fp)
{
    putc((int)((x >> 24) & 0xff), fp);
    putc((int)((x >> 16) & 0xff), fp);
    putc((int)((x >> 8) & 0xff), fp);
    putc((int)(x & 0xff), fp);
}

[P.S. As @Olaf correctly points out in one of the comments, in production code you'd need handling for EOF and error in these functions. I've left those out for simplicity of presentation.]

edited Jun 25 '15 at 14:24

answered Jun 24 '15 at 20:28

Steve Summit

45,437
7
70
103

Would implement get16() with pointer arithmetic on a mmap ed buffer? – clockley1 Jun 24 '15 at 20:32
See? You actually _did_ vote;-) Note @user1450181: you definitively should add error-handling, catching `EOF`/error for buth directions! Also both note: `getc()` returns `int`. casting to `unsigned` is implementation defined_ for negative values. – too honest for this site Jun 24 '15 at 20:50
@Olaf: yes, there are definitely plenty of subtleties here when it comes to signed vs. unsigned variables. (Thanks, too, for the reminder about EOF and error handling.) But although `getc` is indeed declared as returning signed ints, the only negative value it will ever return is `EOF`; all of its normal character returns are guaranteed to be positive. – Steve Summit Jun 25 '15 at 13:15
@SteveSummit: That's why I would fir st use `int`, check for `EOF`, then cast to `unsigned char` and upcast to another unsigned as required. Not very elegant, but that should be safe then. (It will like not create more program code, however). (I'm glad not to be bound to the stdlib functions). – too honest for this site Jun 25 '15 at 13:46
@Olaf: Actually, this is one of the few places where I will sometimes choose _not_ to do exhaustive, call-by-call error checking. Just call `getc` 2 or 4 or 8 or however many times, then check `feof` and `ferror` at the very end. – Steve Summit Jun 25 '15 at 14:13
Ok, agree. I was not sure if `getc()` is allowed to be called after EOF was already reached (it is, according to the standard). – too honest for this site Jun 25 '15 at 14:18
It is important to note, that `feof` does not check the stream, but only the `EOF` flag of the stream. That's why it actually works here; otherwise you could not be sure if the last read value is valid or not. – too honest for this site Jun 25 '15 at 14:23
The `putc()/getc()` has many advantages concerning portability. +1 for showing that. Note that `EOF` may be returned on an error - a rare event - I deal with serial communication a lot - where it is not so rare. The next `getc()` might not be an `EOF`. `ferror()/feof()` helps there. certainly this is beyond OP's needs. – chux - Reinstate Monica Jun 25 '15 at 16:54

How to handle portability issues in a binary file format

1 Answers1

Linked