What is the best approach when working with on-disk data structures

Question

I would like to know how best to work with on-disk data structures given that the storage layout needs to exactly match the logical design. I find that structure alignment & packing do not really help much when you need to have a certain layout for your storage.

My approach to this problem is defining the (width) of the structure using a processor directive and using the width when allocation character (byte) arrays that I will write to disk after appending data that follows the logical structure model.

eg:

typedef struct __attribute__((packed, aligned(1))) foo {
   uint64_t some_stuff;
   uint8_t flag;
} foo;

if I persist foo on-disk the "flag" value will come at the very end of the data. Given that I can easily use foo when reading the data using fread on a &foo type then using the struct normally without any further byte fiddling.

Instead I prefer to do this

#define foo_width sizeof(uint64_t)+sizeof(uint8_t)

uint8_t *foo = calloc(1, foo_width);

foo[0] = flag_value;
memcpy(foo+1, encode_int64(some_value), sizeof(uint64_t));

Then I just use fwrite and fread to commit and read the bytes but later unpack them in order to use the data stored in various logical fields.

I wonder which approach is best to use given I desire the layout of the on-disk storage to match the logical layout ... this was just an example ...

If anyone knows how efficient each method is with respect to decoding/unpacking bytes vs copying structure directly from it's on-disk representation please share , I personally prefer using the second approach since it gives me full control on the storage layout but I am not ready to sacrifice looks for performance since this approach requires a lot of loop logic to unpack / traverse through the bytes to various boundaries in the data.

Thanks.

What does `decode_int64` do? Are you using a string function for binary data? If you do, think about what will happen if one of the bytes in binary value is zero. — Some programmer dude, Nov 19 '14 at 11:44
And why don't you just write/read the structures directly? Then it will work even with padding and proper alignment (unless you plan to move the data between different platforms, or even between programs using different compilers, then you're better off with a serialized text-based data format). — Some programmer dude, Nov 19 '14 at 11:45
I changed it to encode_int64 , sorry was a typo , basically it's for encoding the 64 bit integer into a byte array with respect to endianess since I am not using a struct to do this for me naturally. on the other question , I just need to match the logical layout of the store to the physical layout on disk , a struct is limited since the order of structure elements is restricted to the ordering of the bits represented by each type. There is no way a uint8_t type can come before the unint64_t while maintaining packing and alignment in the example I gave. — DeLorean, Nov 19 '14 at 11:49
Still, using a string function is not correct when copying binary data. Use `memcpy` instead. — Some programmer dude, Nov 19 '14 at 11:53
I changed that too thanks , but it's just a simple example to add context to the question , I wasn't actually going to use the code this way :) — DeLorean, Nov 19 '14 at 11:58
OT: Use parens for macro definitions: `#define foo_width (sizeof(uint64_t)+sizeof(uint8_t))` or things like `2 * foo_width` have funny results. — mafso, Nov 19 '14 at 13:00
Rob Pike's article [The Byte Order Fallacy](http://commandcenter.blogspot.com/2012/04/byte-order-fallacy.html) is worth reading. Manually converting bytes into C datatypes is conceptually superior to and not significantly more expensive than blitting strategies. CPUs are much faster than disks or memories. — NovaDenizen, Apr 16 '15 at 17:36
If no serious Reasons use JSON/XML. If there are serious reasons, add Magic numbers, add memory layout version, normalize endianness (man endian.h, inet/arpa.h), double check sanity of the values. — rralf, Apr 24 '15 at 08:31
@NovaDenizen: turn that comment into an answer, IMO. Very nice article about how to serialize / deserialize data with endian-agnostic code. — Peter Cordes, Jul 05 '15 at 23:23

score 1 · Answer 1 · answered Nov 19 '14 at 23:10

1

Based on your requirements (considering looks and performance), the first approach is better because, the compiler will do the hard work for you. In other words, if a tool (compiler in this case) provides you certain feature then you do not want to implement it on your own because, in most cases, tool's implementation would be more efficient than yours.

answered Nov 19 '14 at 23:10

RcnRcf

356
1
8

Not necessarily true , efficiency comes from how well the programmer uses the tools and this means that the programmer needs to test and evaluate one technique from another .. just like what I am doing at the moment in order to have efficient code :) – DeLorean Nov 20 '14 at 15:51
That's why I said **in most cases**. I have done such evaluation in the past and knew that when I port my code to another tool chain I will have to reevaluate tool versus customized implementation. You will have to do the same. Compare the assembly code generated by the compiler with the two methods and choose the one that is optimal! – RcnRcf Nov 21 '14 at 18:37

score 0 · Answer 2 · answered Nov 20 '14 at 03:48

I prefer something close to your second approach, but without memcpy:

void store_i64le(void *dest, uint64_t value)
{  // Generic version which will work with any platform
  uint8_t *d = dest;
  d[0] = (uint8_t)(value);
  d[1] = (uint8_t)(value >> 8);
  d[2] = (uint8_t)(value >> 16);
  d[3] = (uint8_t)(value >> 24);
  d[4] = (uint8_t)(value >> 32);
  d[5] = (uint8_t)(value >> 40);
  d[6] = (uint8_t)(value >> 48);
  d[7] = (uint8_t)(value >> 56);
}

store_i64le(foo+1, some_value);

On a typical ARM, the above store_i64le method would translate into about 30 bytes--a reasonable tradeoff of time, space, and complexity. Not quite optimal from a speed perspective, but not much worse than optimal from a space perspective on something like the Cortex-M0 which doesn't support unaligned writes. Note that the code as written has zero dependence upon machine byte order. If one knew that one was using a little-endian platform whose hardware would convert unaligned 32-bit accesses to a sequence of 8- and 16-bit accesses, one could rewrite the method as

void store_i64le(void *dest, uint64_t value)
{  // For an x86 or little-endian ARM which can handle unaligned 32-bit loads and stores
  uint32_t *d = dest;
  d[0] = (uint32_t)(value);
  d[1] = (uint32_t)(value >> 32);
}

which would be faster on the platforms where it would work. Note that the method would be invoked the same was as the byte-at-a-time version; the caller wouldn't have to worry about which approach to use.

Thanks for sharing this , I would love it if you could please share more on how well suited this type of approach is against using a c structure , have you discovered any pitfalls in using this method or maybe some few advantages that you could share ... performance wise. — DeLorean, Nov 20 '14 at 15:48
@DeLorean: Coding efficiency will often depend in the extent to which one is willing to optimize for a particular architecture. This approach has the advantage of centralizing all the architecture-specific aspects. Using C structures will work if the architecture is known, but may provide no practical migration path to architectures with different requirements (e.g. little-endian versus big-endian). — supercat, Nov 20 '14 at 16:13

score 0 · Answer 3 · answered Dec 01 '14 at 18:59

If you are on Linux or Windows, then just memory-map the file and cast the pointer to the type of the C struct. Whatever you write in this mapped area will be automatically flushed to disk in the most efficient way the OS has available. It will be a lot more efficient than calling "write", and minimal hassle for you.

As others have mentioned, it isn't very portable. To be portable between little-endian and big-endian the common strategy is to write the whole file in big-endian or little-endian and convert as you access it. However, this throws away your speed. A way to preserve your speed is to write an external utility which converts the whole file once, and then run that utility any time you move the structure from one platform to another.

In the case that you have two different platforms accessing a single file over a shared network path, you are in for a lot of pain if you try writing it yourself just because of synchronization issues, so I would suggest an entirely different approach like using sqlite.

What is the best approach when working with on-disk data structures

3 Answers3