-1

There's a lot of file formats that require writing a set of integers to the file for various reasons. And I've recently found myself doing this

uint32_t u32;

u32 = ...
fwrite(&u32, 4, 1, file);

u32 = ...
fwrite(&u32, 4, 1, file);

...

which seems unnecessary and tedious. Is this a good practice? Or is there a better solution?

GuillemVS
  • 356
  • 1
  • 13
  • 4
    You can put them in an array and write the whole array. – Barmar Feb 08 '23 at 21:53
  • 1
    _"unnecessary"_ - how? – Ted Lyngmo Feb 08 '23 at 21:53
  • 5
    Use `sizeof u32` to make maintenance easier. No hard coded magic numbers. Sorry about the extra typing. – Weather Vane Feb 08 '23 at 21:54
  • If writing two statements for each write is annoying, put it in a function. – Barmar Feb 08 '23 at 21:54
  • @TedLyngmo well, I don't know exactly how this translates into assembly, but from a high level perspective it looks like a lot of repetition – GuillemVS Feb 08 '23 at 21:54
  • 5
    This depends entirely on the requirements of your file format. Do you care about endianness? Handling systems where `uint32` is different sizes? – Stephen Newell Feb 08 '23 at 21:56
  • Yes, well, `size_t write_u32s(const uint32_t *data, size_t elems, FILE *file) { return fwrite(data, sizeof *data, elems, file); }` could help a little. Just put them in an array as @Barmar suggests and call that function. – Ted Lyngmo Feb 08 '23 at 21:56
  • I sometimes use a macro only defined in a given function that hard-codes most local variables (i.e. opposite of functional programming). Another option would be to implement a fprintf()-like function to write binary data. – Allan Wind Feb 08 '23 at 22:04
  • 1
    @WeatherVane How does it make maintenance easier? The only thing that I can think of is that the variable may change size, and now `u32` is a particular bad choice of name for a variable. I was never keen on Hungarian notation for the same reason. – Allan Wind Feb 08 '23 at 22:10
  • 2
    @Allan Wind, because it will be tied to the definition, and one less thing to consider. We do that with `malloc` calls too. Yes, the choice of name is as bad as `var` but that doesn't defeat my argument. – Weather Vane Feb 08 '23 at 22:13
  • Sidenote: Writing within an array results in less syscalls, thus a little more efficient... – Aconcagua Feb 08 '23 at 22:14
  • @StephenNewell The point of uin32_t is the fixed size. What am I missing? – Allan Wind Feb 08 '23 at 22:16
  • 1
    @StephenNewell how can `uint32_t` have another size? There is significant difference between `unsigned int` and `uint32_t` – 0___________ Feb 08 '23 at 22:17
  • To be portable save them as text :) – 0___________ Feb 08 '23 at 22:18
  • 4
    `uint32_t` is not necessarily 4 bytes ( a byte may be larger than 8 bits ) , the second argument should be `sizeof u32` – M.M Feb 08 '23 at 22:22
  • 3
    Unless I'm mistaken, there are (admittedly rare) systems where a byte is more than 8 bits. – Stephen Newell Feb 08 '23 at 22:22
  • I doubt you would measure any difference between multiple fwrite calls, and writing to an in-memory array first. The file subsystem buffers writes . But it might be interesting to try anyway. – M.M Feb 08 '23 at 22:24
  • As @StephenNewell mentions, if a byte is 16 or 32 bits, then `fwrite(..., 4, ...` is going to blow up. At least my `man fwrite` page talks about bytes and not octets. – Ted Lyngmo Feb 08 '23 at 22:33
  • @AllanWind You're right, and most of the other commentators here are, IMO, wrong. If the file format is fixed, but the size of a variable changes, you do *not* want your code to start writing files in a different format. So the hardcoded "magic number" 4 is preferable, and `sizeof(u32)`, can be quite dangerous. Where I work we had some seriously production-impacting problems with some analogous code where the variable was of type `long`, which changed size under x86_64, meaning that some programs were writing data files that other programs couldn't read. (Admittedly type `long` was wrong.) – Steve Summit Feb 08 '23 at 23:06
  • Is this file supposed to be read always on the same architecture? If not, besides size you should also care about endianness. – dimich Feb 08 '23 at 23:19
  • @SteveSummit because of that mistake you mention, do you always hard code, for example, `int *myptr = malloc(4)` or `mytype *myptr = malloc(4)`? – Weather Vane Feb 08 '23 at 23:34
  • 1
    *The typedef name `uintN_t` designates an unsigned integer type with width N and no padding bits. Thus, `uint24_t` denotes such an unsigned integer type with a width of exactly 24 bits.* – 0___________ Feb 08 '23 at 23:34
  • @WeatherVane no one is talking about magic numbers and hardcoded values. But some people insist that `uint32_t` can have nor 32bits – 0___________ Feb 08 '23 at 23:35
  • 2
    @0___________ so if the type is changed from `uint32_t` to `uint16_t` you would welcome hunting through the program and figuring out which `4`s should be `2` and which relate to something else that is `4` and must not be changed? – Weather Vane Feb 08 '23 at 23:42
  • er... in the last comment but one. It was you who decided to take issue with it. – Weather Vane Feb 08 '23 at 23:45
  • @WeatherVane Of course not. malloc'ing memory to hold an object, versus writing an object's value to a data file, are totally different problems. – Steve Summit Feb 08 '23 at 23:49
  • 1
    @SteveSummit I disagree with your earlier objection, which seems to have a quite different cause - the oversight of what `long` is and its ramifications still rankle you. The same applies with `fwrite` - use the size of the type. – Weather Vane Feb 08 '23 at 23:52
  • @WeatherVane *so if the type is changed... you would welcome hunting...* My point is that if the file format is constant, you do not (should not, must not) do any hunting, because those sizes are utterly fixed and must never change. That's why hardcoding them is acceptable, and IMO preferable. (But this is, I know, heretical advice, and I don't expect to convince anyone who's utterly fixed that hardcoded size values are always wrong.) – Steve Summit Feb 08 '23 at 23:52
  • 1
    @SteveSummit if the data size was important, the mistake was to use `long`. The rest only followed. – Weather Vane Feb 08 '23 at 23:55
  • @WeatherVane Yes, using `long` was wrong. And, as dimich observed, byte order is always an issue. But if the file format must not change, then you can either (a) hardcode the size in your write statements as I am heretically advocating, or (b) use `sizef(var)`, but immediately adjacent have an `assert(sizeof(var) == 4);`. (Actually, for maximum protection, you may need an assertion in either case.) – Steve Summit Feb 08 '23 at 23:55
  • I believe that using `sizeof` is wrong antipreferable, because you're saying, "if the size of this variable changes, I want the file format to silently change". But I do *not* want the file format to silently change. Ever. – Steve Summit Feb 08 '23 at 23:57
  • 1
    @SteveSummit IMO a better solution would be to use `int32_t` instead of `long` and annotate the variable definition to say "do not change this size". (and an assertion). It's still no excuse for hard-coding sizes, because who knows what the future might bring? (and an update of the database). – Weather Vane Feb 08 '23 at 23:58
  • @WeatherVane We won't agree on this, but despite everything you've said, IMO this absolutely *is* an excuse for hard-coding sizes. The code that writes the file is the code that defines the file format. If that format ever changes, it's easiest if all the explicit sizes are right there, too. – Steve Summit Feb 09 '23 at 00:03
  • @SteveSummit surely if you are so fixed on it, you could have `int32_t datavar; size_t datasize = 4;`? No magic numbers to hunt down. – Weather Vane Feb 09 '23 at 00:05
  • @SteveSummit *because those sizes are utterly fixed and must never change* Yeah, I want a pony, too. ;-) If you use `sizeof()` and the type never changes, you never have to change anything anyway. And if the type *does* change, you won't have to worry "Did I miss one somewhere??" and hope the change doesn't result in some catastrophic bug, all because you didn't want to spend a few seconds more typing `sizeof(var)` instead of `4` for each of the dozen or so read/write operations? *The code that writes the file is the code that defines the file format.* 900 pages that cost $5,000,000 is. :-( – Andrew Henle Feb 09 '23 at 00:34
  • @SteveSummit FWIW, I appreciate that you made the argument, and I have no qualms with hard-coding 4 for an uint32_t value, just like I routinely remove "sizeof(char)" as noise in my code. When you hard-code the size of your long you may end up with data corruption depending on Endianess, if the app somehow managed to use a larger value than expected, and sign. Without the hard-coded size you would have to write a conversation tool, or add support for two formats. I am just kinda curious why did you conclude that potential corruption (data loss) was preferable? – Allan Wind Feb 09 '23 at 00:49
  • 1
    @AndrewHenle If the type changes and the 900-page spec didn't, the worry is not "Did I miss one somewhere?", the worry is that there is a catastrophic bug *caused by the use of `sizeof`*, in conjunction with the variable that — for whatever reason — changed size. If you used sizeof, the change in the size of a variable *will* cause a bug. If you hardcode the read/write size, the change in the size of a variable *might* cause a bug, depending on endianness, and on whether the change made something larger or smaller. But it's this "will" vs. "might" that makes me prefer hardcoded sizes here. – Steve Summit Feb 09 '23 at 00:53
  • @AllanWind If by "Why did you conclude that potential corruption was preferable?" you mean, why did the legacy code I was referring to use, in effect, `long` + `sizeof`? It was legacy code, that I didn't write. (Also it was C++, so the `sizeof` was basically implicit.) The fix was of course to change it to use `int32_t` instead of `long`, but this happened only after enough "bad" data files got written that they caused lots of problems. (So, yes, we were forced to add support for two formats.) – Steve Summit Feb 09 '23 at 01:19
  • @SteveSummit Apologies if I misunderstood, but I read your comments as it's preferable to use a fixed size which may cause corruption opposed to a sizeof() which changes the size of the file. You already answered that with "will" vs "might' in the above, – Allan Wind Feb 09 '23 at 01:37
  • @SteveSummit *If you hardcode the read/write size, the change in the size of a variable might cause a bug* And such a bug could very well invoke undefined behavior. This is no different from using `int *p = malloc( n * sizeof( *p ));`. – Andrew Henle Feb 09 '23 at 01:50
  • @SteveSummit Disagree with you as well... At very first: In such scenarios it is absolutely essential to use the correct types at very first. This is what `stdint.h` header was designed for – for having data types of *exact* width in *bits*. These aliase the apprpriate types, e.g. `uint64_t` is `long` on 64-bit linux, but `long long` on e.g. 32-bit linux or windows. One doesn't have to care for any further, apart from e.g. `printf` requiring these more complicated format strings (`"" PRI... ""`). – Aconcagua Feb 09 '23 at 05:46
  • 1
    `sizeof(var)` (or `*var` on pointers) then – *and only then* – assures correct number of bytes to be written, whatever platform we are on, having `CHAR_BIT == 8` will print exactly 8 bytes, but a machine with `CHAR_BIT == 16` (which won't have a `uint8_t` – theres a DSP of TI having 16-bit char, by the way) will print four double sized bytes. Then if there's risk of file format changing *yet another alias*, if consistently used, fixes this issue as well: `typedef uint32_t MyIdType`. – Aconcagua Feb 09 '23 at 05:46
  • And even if you didn't use `stdint.h` this would result in one single location to change, no matter if swapping to another machine with different `char` size, swapping to OS where required underlying type is different (e.g. needing `long` vs. `long long`) or file format changing. – Aconcagua Feb 09 '23 at 05:46
  • @SteveSummit By the way: The assertion you introduced in one of your previous comments is *partially wrong*! It would have needed to be: `assert(sizeof(var) * CHAR_BIT == 32)`... I personally would have preferred a static compile time check anyway, e.g. like [here](https://godbolt.org/z/8d8nf9W4o). – Aconcagua Feb 09 '23 at 06:13
  • 1
    @SteveSummit If you have `uint32_t x; fwrite(&x, sizeof x, 1, file);` - how would the byte size enable it to write anything but exacly 4 octets as required by the file format? If you hardcode `4` and `CHAR_BIT` is `16`, it will be UB. `sizeof x` will on the other hand be `2` so 2 (16 bit) bytes are written. – Ted Lyngmo Feb 09 '23 at 09:01
  • @Aconcagua, Ted Lyngmo: Thanks for the comments. This comment thread is clearly beyond out of hand, so I won't debate the point further here. I had an entire answer composed off-line last night, exploring the "to `sizeof` or not to `sizeof`" question in a much less contentious way (clearly I should never have said "most of the other commentators here are wrong"), but now that the question's closed, that answer remains unposted. – Steve Summit Feb 09 '23 at 12:03
  • @SteveSummit love or hate Microsoft, they made an interesting decision - the size of `long` is still 32 bits even after everybody else changed it to 64. They realized too much existing code was tied to assumptions about the size. – Mark Ransom Feb 09 '23 at 19:53
  • @SteveSummit Fair enough! Cheers! – Ted Lyngmo Feb 09 '23 at 22:17

2 Answers2

4

"Or is there a better solution?"

Yes. At a minimum use if (fwrite(&u32, sizeof u32, 1, file) == 1) ....
(Avoid magic numbers, test I/O result, ...)

Better beyond that, best solution depends on context and coding goals not stated.

chux - Reinstate Monica
  • 143,097
  • 13
  • 135
  • 256
0

When reading and writing binary-format files, my preference is to define sets of "get" and "put" functions, like this:

/* read a 4-byte little-endian integer */
int32_t get32(FILE *fp)
{
    int32_t ret = 0;
    ret |= getc(fp);
    ret |= getc(fp) << 8;
    ret |= (uint32_t)getc(fp) << 16;
    ret |= (uint32_t)getc(fp) << 24;
    return ret;
}

/* write a 4-byte little-endian integer */
void put32(int32_t val, FILE *fp)
{
    putc( val        & 0xff, fp);
    putc((val >>  8) & 0xff, fp);
    putc((val >> 16) & 0xff, fp);
    putc((val >> 24) & 0xff, fp);
}

Then instead of your fwrite(&u32, 4, 1, file), I'd call put32(u32, file).

This is, admittedly, not that much more convenient than fwrite, but it gives precise control over byte order, which is the other big issue when it comes to reading/writing binary data files.

Also, getting these "get" and "put" routines absolutely right can be a little tricky. The simplified versions I've shown aren't perfect, as they play a little fast-and-loose with signed/unsigned distinctions, and the get32 I've shown doesn't have any EOF handling.

Typing in all the get's and put's can indeed be a nuisance, but in my experience it's actually a lot less of a nuisance, in the end, than some of the allegedly "more convenient" techniques people try to use. For example, rather than reading and writing one variable at a time, it's quite popular to put all your variables into a structure and then read and write whole instances of that structure — but then you get to spend many, many hours on the joyless task of trying to manage your structure padding and alignment.

If you truly have so many of these to read and write that manually maintaining the get's and put's is unworkable, I'd say it's time to look into some kind of interface generator to write the serialization and deserialization code for you, based on a nice, high-level description of your data file format and your mapping of data file fields to variable or member names in your code.

Steve Summit
  • 45,437
  • 7
  • 70
  • 103
  • I suggest you either make it a uint32_t or show us how to handle negative values. For this answer I think the former is perfectly acceptable. If it's not too much of a bother I would like to see a better signed version. For instance >> is implementation defined for negative values. Isn't it slow do this per byte opposed to htobe32/htole32? – Allan Wind Feb 09 '23 at 01:02
  • @chux and Allan Wind: I don't usually care about raw speed, and... – Steve Summit Feb 09 '23 at 02:58
  • @AllanWind and chux: ...I hate, hate, *hate* writing code that has to adjust itself based on which kind of platform it's running on. I *love* code like in this answer (even if it's poky and slow), because it just inherently writes little-endian, no matter what, by the laws of mathematics or something, and it's patently obvious to the reader that it's the LSB that gets read/written first. – Steve Summit Feb 09 '23 at 02:58
  • Adding error handling for each and every call to `getc()` and `putc()` here is going to result in horribly complex code. `uint32_t tmp = htonl(data);fwrite(&tmp,1,sizeof tmp, fp);` and `fread(&tmp,1,sizeof tmp, fp); uint32_t data = ntohl(tmp);`. Done, and adding robust error checking is easy. – Andrew Henle Feb 10 '23 at 00:35
  • @AndrewHenle The error checking doesn't have to be too bad, if you're willing to defer the check for a bit. See for example the code fragments in [this answer](https://stackoverflow.com/questions/75754361#75756400). – Steve Summit Mar 17 '23 at 19:37