What's a performant and clean way to parse a binary file in C?

Question

I'm parsing a custom binary file structure for which I know the format.

The general idea is that each file is broken up into blocks of sequential bytes, which I want to separate and decode in parallel.

I'm looking for a readable, performant alternative to decode_block()

Here's what I'm currently working with:

#include <stdio.h>

int decode_block(uint8_t buffer[]);

int main(){
  FILE *ptr;

  ptr = fopen("example_file.bin", "rb");
  if (!ptr){
    printf("can't open.\n");
    return 1;
  }

  int block1_size = 2404;
  uint8_t block1_buffer[block1_size];
  fread(block1_buffer, sizeof(char), block1_size, ptr);

  int block2_size = 3422;
  uint8_t block2_buffer[block2_size];
  fread(block2_buffer, sizeof(char), block2_size, ptr);

  fclose(ptr);

  //Do these in parallel
  decode_block(block1_buffer);
  decode_block(block2_buffer);
  return 0;
}

int decode_block(uint8_t buffer[]){
  unsigned int size_of_block = (buffer[3] << 24) + (buffer[2] << 16) + (buffer[1] << 8) + buffer[0];
  unsigned int format_version = buffer[4];
  unsigned int number_of_values = (buffer[8] << 24) + (buffer[7] << 16) + (buffer[6] << 8) + buffer[5];
  unsigned int first_value = (buffer[10] << 8) + buffer[9];

  // On and on and on

  int ptr = first_value
  int values[number_of_values];

  for(int i = 0; i < number_of_values; i++){
    values[i] = (buffer[ptr + 3] << 24) + (buffer[ptr + 2] << 16) + (buffer[ptr + 1] << 8) + buffer[ptr];
    ptr += 4
  }

  // On and on and on

  return 0
}

It feels a little redundant to be reading the entire file into a byte array and then interpreting the array byte by byte. Also it makes for very bulky code.

But since I need to operate on multiple parts of the file in parallel I can't think of another way to do this. Also, is there a simpler or faster way to convert the early bytes in buffer into their respected metadata values?

Do you know where the block boundaries are in the file, without having to scan it for them? — jwdonahue, Jun 05 '20 at 19:04
Open the file twice - 2 file pointers. Have 1 thread start reading/decoding from the beginning and another thread start reading/decoding in the middle. See: fseek. — 001, Jun 05 '20 at 19:04
@Johnny Mopp is there no problem with file permissions to be reading the same file with two different pointers at the same time? — CSStudent7782, Jun 05 '20 at 19:06
[C open file multiple times](//stackoverflow.com/a/15620747) — 001, Jun 05 '20 at 19:08
@Johnny Mopp oh wow, that's definitely 1 big upgrade there, thank you. — CSStudent7782, Jun 05 '20 at 19:21
Ya just spin up a thread per block, each one seeks to the appropriate offset and reads it's block of data. It gets a little tricky if any of them have to write back to the file, particularly if it changes the size of the block. — jwdonahue, Jun 05 '20 at 19:47
Mind you, too many threads in the same file, even if they are only reading, can drive performance down, especially if file system read-ahead buffering is exceeded by file size or the work to be performed on each block is trivial compared to the block seek & read times. Another approach is to setup a thread pool to process blocks off of a queue and have one thread read in the blocks and queue them up. Then you have processing as soon as the first block is available and you use more cores for processing. — jwdonahue, Jun 05 '20 at 19:54
You're probably better off with `read` (vs. `fread`) if you're going to do multiple threads. Using `pread` is even better. But, the _really_ fast way is to `mmap` the file, and pass offsets/lengths within the mapped area to the various threads. Thus, no reading required — Craig Estey, Jun 05 '20 at 19:56
@CraigEstey Thank you for the 4 options. mmap sounds really cool, I'm going to check it out! — CSStudent7782, Jun 05 '20 at 21:20
The reason I say `mmap` is faster because I've done some benchmarks: https://stackoverflow.com/questions/33616284/read-line-by-line-in-the-most-efficient-way-platform-specific/33620968#33620968 And, [more] recently, I did a multithreaded version: https://stackoverflow.com/questions/60779978/memory-leak-how-do-i-allocate-memory-for-a-typdef-struct-passed-within-another/60780421#60780421 — Craig Estey, Jun 05 '20 at 23:38

Brendan · Accepted Answer · 2020-06-05T21:31:01.287

I'd:

use "memory mapped files" to avoid loading the raw data (e.g. mmap() in POSIX systems). Note that this is not portable "plain C", but almost every OS supports a way to do this.
make sure that the file format specification requires that the values are aligned to a 4-byte boundary in the file and (if you actually do need to support signed integers) that the values are stored in "2's compliment" format (and not "sign and magnitude" or anything else)
check that the file complies with the specification as much as possible (not just the alignment requirement, but including things like "data can't start in middle of header", "data start + entries * entry_size can't exceed file's size", "version not recognized", etc).
have different code for little-endian machines (e.g. where which code is used may be selected at compile time with an #ifdef), where you can cast the memory mapped file's data to int32_t (or uint32_t). Note that the code you've shown (e.g. (buffer[ptr + 3] << 24) + (buffer[ptr + 2] << 16) + (buffer[ptr + 1] << 8) + buffer[ptr]) is broken for negative numbers (even on "2's compliment" machines); so the alternative code (for "not little-endian" cases) will be more complicated (and slower) than yours is. Of course if you don't need to support negative numbers you should not be using any signed integer type (e.g. int), and quite frankly you shouldn't be using "possibly 16 bit" int for 32-bit values anyway.
determine how many threads you should use (maybe command line argument; maybe by asking OS how many CPUs the computer actually has). Start the threads and tell them which "thread number" they are (where existing thread is number 0, first spawned thread is number 1, etc).
let the threads calculate their starting and ending offset (in the memory mapped file) from their "thread number", a global "total threads", a global "total entries" and a global "offset of first entry". This is mostly just division with special care for rounding. Note that (to avoid global variables) you could pass a structure containing the details to each thread instead. No safeguards (e.g. locks, critical sections) will be needed for this data because threads only read it.
let each thread parse its section of the data in parallel; then wait for them all to finish (e.g. maybe "thread number 0" does "pthread_join()" if you don't want to keep the threads for later use).

You will probably also need to check that all values (parsed by all threads) are within an allowed range (to comply with the file format specification); and have some kind of error handling for when they don't (e.g. when the file is corrupt or has been maliciously tampered with). This could be as simple as a (global, atomically incremented) "number of dodgy values found so far" counter; which could allow you to display an "N dodgy values found" error message after all values are parsed.

Note 1: If you don't want to use a memory mapped file (or can't); you can have one "file reader thread" and multiple "file parser threads". This takes a lot more synchronization (it devolves into a FIFO queue with flow control - e.g. with provider thread doing some kind of "while queue full { wait }" and consumer threads doing some kind of "while queue empty { wait }"). This extra synchronization will increase overhead and make it slower (in addition to being more complex), compared to using memory mapped files.

Note 2: If the file's data isn't cached by the operating system's file data cache, then you'll probably be bottlenecked by file IO regardless of what you do and using multiple threads probably won't help performance for that case.

This is a fantastic answer, thank you very much! I'm not sure what you mean by checking that "data can't start in middle of header". What would this mean and how could I check for it? — CSStudent7782, Jun 05 '20 at 22:08
@CSStudent7782: I took a guess that `first_value = (buffer[10] << 8) + buffer[9];` was used to determine the offset of the first value in the file (e.g. so that you can increase the size of the header in future without breaking backward compatibility). If my guess is right, then you might have a check like `if(first_value < 11)` (because the header was 11 bytes for that version of it). — Brendan, Jun 06 '20 at 10:50

What's a performant and clean way to parse a binary file in C?

1 Answers1