Parsing a binary file. What is a modern way?

Question

I have a binary file with some layout I know. For example let format be like this:

2 bytes (unsigned short) - length of a string
5 bytes (5 x chars) - the string - some id name
4 bytes (unsigned int) - a stride
24 bytes (6 x float - 2 strides of 3 floats each) - float data

The file should look like (I added spaces for readability):

5 hello 3 0.0 0.1 0.2 -0.3 -0.4 -0.5

Here 5 - is 2 bytes: 0x05 0x00. "hello" - 5 bytes and so on.

Now I want to read this file. Currently I do it so:

load file to ifstream
read this stream to char buffer[2]
cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.
read a stream to vector<char> and create a std::string from this vector. Now I have string id.
the same way read next 4 bytes and cast them to unsigned int. Now I have a stride.
while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.

This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?

P.S.: while I wrote a question, the more clearer explanation raised in my head - how to cast arbitrary number of bytes from arbitrary position in char [x]?

Update: I forgot to mention explicitly that string and float data length is not known at compile time and is variable.

Modern? I'd recommend [Boost](http://www.boost.org/).[Spirit](http://www.boost.org/libs/spirit/).[Qi](http://www.boost.org/libs/spirit/doc/html/spirit/qi/tutorials/quick_start.html), or at least avoid any solution that involves casting/aliasing. — ildjarn, Nov 10 '14 at 14:04
@ildjarn Oh, sorry, forgot to mention - pure c++ without libraries. — nikitablack, Nov 10 '14 at 14:05
Why on earth would you want to reinvent the (in this case very non-trivial) wheel? — ildjarn, Nov 10 '14 at 14:07
@ildjarn Let's say - for learning purposes. Actually it's not complicated - my code is simple and workable. I just have a 6th sense that there's a better **native** way. — nikitablack, Nov 10 '14 at 14:09
Why include the length of the string if it is hardset to 5 characters? — Neil Kirk, Nov 10 '14 at 14:09
Your code fails the alignment and aliasing tests as well – there's more to this than is immediately apparent. ;-] — ildjarn, Nov 10 '14 at 14:10
If the file in text looks like this: `5hello`, then `5` takes 1 byte, not 2 — Neil Kirk, Nov 10 '14 at 14:11
@Neil Kirk String is not fixed, why do you think so? And 5 is decoded as 2 bytes exactly - that's approved layout format. — nikitablack, Nov 10 '14 at 14:14
It says in your spec `5 bytes (5 x chars) - the string - some id name` It doesn't say size depends on previous parameter. — Neil Kirk, Nov 10 '14 at 14:15
@Neil Kirk Sorry if it wasn't clear. And the length of the string is encoded as 2 bytes. Btw, what the "5h" is? — nikitablack, Nov 10 '14 at 14:22
If file contains "5h" and you read two bytes, you get "5h" not "5" — Neil Kirk, Nov 10 '14 at 14:22
@NeilKirk Dude he clearly means the bytes are 0x05 0x00 0x68 0x65 0x6c 0x6c ... — Barry, Nov 10 '14 at 14:22
"The file should look like (I added spaces for readability):" So I assumed removing the spaces would give the original file. — Neil Kirk, Nov 10 '14 at 14:23
@Barry "5hello" is a valid binary file! In context of I/O, binary just means there is no special newline jiggery-pokery. — Neil Kirk, Nov 10 '14 at 14:33
There is no parsing to be done here. You don't need any parsing, you need *serialization*. You also either don't understand what a binary fike looks like, or for some reason misrepresent it in your example. A binary file that stores the number 5 as a two byte integer will not store it as a character '5' followed by a space. — n. m. could be an AI, Nov 10 '14 at 14:36
C'mon guys. I wrote that the first 2 bytes is a length of a string. If I'd write 0x050x00 is it better? — nikitablack, Nov 10 '14 at 14:45
A binary `float` doesn't have variable length, it has the length of 4 known at compile time. In short you don't have parsing and you don't have binary, you have serialization to/from a text file. — n. m. could be an AI, Nov 10 '14 at 14:45
Yes it would be better, with a space between hex values. Also now you need to consider endianness. — Neil Kirk, Nov 10 '14 at 14:46
The [kaitai](https://kaitai.io/) library is also very helpful. — Alexander Cai, Jun 22 '21 at 19:43

score 14 · Answer 1 · answered Nov 10 '14 at 14:49

14

If it is not for learning purpose, and if you have freedom in choosing the binary format you'd better consider using something like protobuf which will handle the serialization for you and allow to interoperate with other platforms and languages.

If you cannot use a third party API, you may look at QDataStream for inspiration

answered Nov 10 '14 at 14:49

fjardon

7,921
22
31

protobuf is a **serialization** technology, and does a very poor job as a file format. There are much more appropriate tools for that, such as JSON, XML or SQLite. – slaphappy Nov 10 '14 at 15:03
9

@Mr.kbok In my opinion, every binary format is a bad format. I think that's why 2 out of 3 of your alternatives include **text** file formats. The best reasons for using a binary format is compactness and reading/writing speed. protobuf perfetcly fullfills these 2 goals. It also adds portability and versionning. – fjardon Nov 10 '14 at 15:18
3

Not at all. I used text formats as examples because they are easy to use for new programmers, but there are plenty of excellent binary formats out there (think OLAP, media files, etc.). Protobuf is hard to use correctly, and, as a streaming format, requires you to go through your whole file to find some specific information. In this regard, this is a terrible format for a file. – slaphappy Nov 10 '14 at 15:23
@Mr.kbok `protobuf` has a key feature other binary formats does not have: customizability. You cannot stuff arbitrary data arbitrarily structured into a JPEG or MP4. – Siyuan Ren Nov 11 '14 at 01:43
@Mr.kbok: there is no sense in speaking of *laziness* for the format `protobuf`, because implementations are explicitly allowed to be both lazy and non-lazy; see [https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.descriptor.pb](https://developers.google.com/protocol-buffers/docs/reference/cpp/google.protobuf.descriptor.pb) starting at "Should this field be parsed lazily?". Google's reference C++ implementation is lazy as far as I recall. – Matthieu M. Nov 12 '14 at 18:33

score 11 · Accepted Answer · edited May 23 '17 at 12:32

11

The C way, which would work fine in C++, would be to declare a struct:

#pragma pack(1)

struct contents {
   // data members;
};

Note that

You need to use a pragma to make the compiler align the data as-it-looks in the struct;
This technique only works with POD types

And then cast the read buffer directly into the struct type:

std::vector<char> buf(sizeof(contents));
file.read(buf.data(), buf.size());
contents *stuff = reinterpret_cast<contents *>(buf.data());

Now if your data's size is variable, you can separate in several chunks. To read a single binary object from the buffer, a reader function comes handy:

template<typename T>
const char *read_object(const char *buffer, T& target) {
    target = *reinterpret_cast<const T*>(buffer);
    return buffer + sizeof(T);
}

The main advantage is that such a reader can be specialized for more advanced c++ objects:

template<typename CT>
const char *read_object(const char *buffer, std::vector<CT>& target) {
    size_t size = target.size();
    CT const *buf_start = reinterpret_cast<const CT*>(buffer);
    std::copy(buf_start, buf_start + size, target.begin());
    return buffer + size * sizeof(CT);
}

And now in your main parser:

int n_floats;
iter = read_object(iter, n_floats);
std::vector<float> my_floats(n_floats);
iter = read_object(iter, my_floats);

Note: As Tony D observed, even if you can get the alignment right via #pragma directives and manual padding (if needed), you may still encounter incompatibility with your processor's alignment, in the form of (best case) performance issues or (worst case) trap signals. This method is probably interesting only if you have control over the file's format.

edited May 23 '17 at 12:32

Community

1
1

answered Nov 10 '14 at 14:04

slaphappy

6,894
3
34
59

16

This fails to align the data properly. – ildjarn Nov 10 '14 at 14:06
23

-1, this is a very bad idea. Structures can (and very often will) have invisible padding bytes added for alignment, which the file won't have. – unwind Nov 10 '14 at 14:06
12

Alignment is corrected via pragmas. This doesn't change the nature of the technique. – slaphappy Nov 10 '14 at 14:11
4

Pragmas are not portable. – ildjarn Nov 10 '14 at 14:11
@Mr. kbok Sorry. can't get how it should work - in struct I have to define a vectors, but what length should they be? – nikitablack Nov 10 '14 at 14:12
3

This will not work if the data structure contains non-pod types. – Captain Obvlious Nov 10 '14 at 14:12
2

Reading binary files isn't either; However, pragma are now sufficiently common to assume that they are available on OP's compiler. – slaphappy Nov 10 '14 at 14:13
10

You can correct normally alignment and padding of the structure using pragmas,but 1) your current code doesn't know the alignment of data at `buf.data()` so on some CPUs you'll still get SIGBUS or similar or reduced performance for misaligned reads when trying to extract the data via `stuff`, and further the binary file itself may not have data at offsets that can be aligned simply by reading the data in at a specific alignment. For example, if there are two 32-bit floats with a char between them, then any approach using a `struct` and wholesale binary read has potential issues. – Tony Delroy Nov 10 '14 at 14:15
2

@nikitablack Sorry, forgot to specify that (as Captain Obvlious said) it only works for POD types: floats, ints, chars, etc. – slaphappy Nov 10 '14 at 14:16
1

+1, although everyone always downvotes this idea I happily used it for production in really complex cases with no portability issues, you just need to do things carefully. But in my case I skip directly C++ I/O and use standard C I/O. – Jack Nov 10 '14 at 14:18
1

`contents` is a packed struct, so if you're just reading POD types (which OP seems to be!), how is this not the right answer? +1. – Barry Nov 10 '14 at 14:20
2

Oh nvm, it's string length followed by N bytes, so it's not a static type. But still, this answer deserves way better than -3... sheesh. – Barry Nov 10 '14 at 14:21
Yeah - the variable length string means the size and string need to be read another way, then a struct could be used to parse the remaining fixed-layout content. – Tony Delroy Nov 10 '14 at 14:24
@Barry: Yes, it's quite strange since I thought this was a very common technique in C. – slaphappy Nov 10 '14 at 14:24
Actually in my example the string is variable and float data is variable too. – nikitablack Nov 10 '14 at 14:29
@nikitablack The technique can be extended. I edited my answer. – slaphappy Nov 10 '14 at 14:37
@Mr. kbok I'm trying your approach, but I have compilation fails. Can you please explain how things should work? I'm using VS2013. If I insert your code as is I get error _C2100: illegal indirection_ on line `*target = reinterpret_cast(*buffer);`. When I'm trying specialization template I get _error C2910: 'read_object' : cannot be explicitly specialized_ – nikitablack Nov 10 '14 at 16:50
@nikitablack: It's `(buffer)` instead of `(*buffer)`. Regarding the error 2910, it was an issue in the code. I fixed it. – slaphappy Nov 10 '14 at 16:57
@nikitablack: You're welcome. Your question started quite a debate :) – slaphappy Nov 10 '14 at 17:10
3

The `reinterpret_cast` will lead to undefined behavior when you access the data due to breaking the aliasing rules. If you want to do this then it's probably better to create a variable of your struct type and use it as your buffer, or `memcpy` from your buffer into the struct. In all cases you should validate the data to avoid security vulnerabilities from malformed input. – bames53 Nov 11 '14 at 17:53
1

@bames53 Actually, if a type is trivially copyable, it can be converted to/from an array of char and is guaranteed to retain its value. – Red Alert Nov 11 '14 at 21:29
4

@RedAlert But it's still not legal to access a char array through a pointer to some other type. – bames53 Nov 11 '14 at 22:30
@bames53 yes, that's what I've read. Just curious, what is the difference between reading the char array through a ptr to a different type, and reading a memcpy of the char array to a ptr to that different type? Does memcpy do anything beyond a dumb copy of every single byte? – Red Alert Nov 11 '14 at 23:34
4

@RedAlert the difference is that the compiler isn't required to generate working code for aliasing violations. The most straightforward example of things going wrong is that some hardware may produce a bus error when an unaligned object is accessed. The optimizer can also generate code that assumes the aliasing violation never happens, and so produce bizarre results if it does. – bames53 Nov 12 '14 at 03:17

score 11 · Answer 3 · edited Jun 20 '20 at 09:12

Currently I do it so:

load file to ifstream

read this stream to char buffer[2]

cast it to unsigned short: unsigned short len{ *((unsigned short*)buffer) };. Now I have length of a string.

That last risks a SIGBUS (if your character array happens to start at an odd address and your CPU can only read 16-bit values that are aligned at an even address), performance (some CPUs will read misaligned values but slower; others like modern x86s are fine and fast) and/or endianness issues. I'd suggest reading the two characters then you can say (x[0] << 8) | x[1] or vice versa, using htons if needing to correct for endianness.

read a stream to vector<char> and create a std::string from this vector. Now I have string id.

No need... just read directly into the string:

std::string s(the_size, ' ');

if (input_fstream.read(&s[0], s.size()) &&
    input_stream.gcount() == s.size())
    ...use s...

the same way read next 4 bytes and cast them to unsigned int. Now I have a stride. while not end of file read floats the same way - create a char bufferFloat[4] and cast *((float*)bufferFloat) for every float.

Better to read the data directly over the unsigned ints and floats, as that way the compiler will ensure correct alignment.

This works, but for me it looks ugly. Can I read directly to unsigned short or float or string etc. without char [x] creating? If no, what is the way to cast correctly (I read that style I'm using - is an old style)?

struct Data
{
    uint32_t x;
    float y[6];
};
Data data;
if (input_stream.read((char*)&data, sizeof data) &&
    input_stream.gcount() == sizeof data)
    ...use x and y...

Note the code above avoids reading data into potentially unaligned character arrays, wherein it's unsafe to reinterpret_cast data in a potentially unaligned char array (including inside a std::string) due to alignment issues. Again, you may need some post-read conversion with htonl if there's a chance the file content differs in endianness. If there's an unknown number of floats, you'll need to calculate and allocate sufficient storage with alignment of at least 4 bytes, then aim a Data* at it... it's legal to index past the declared array size of y as long as the memory content at the accessed addresses was part of the allocation and holds a valid float representation read in from the stream. Simpler - but with an additional read so possibly slower - read the uint32_t first then new float[n] and do a further read into there....

Practically, this type of approach can work and a lot of low level and C code does exactly this. "Cleaner" high-level libraries that might help you read the file must ultimately be doing something similar internally....

You won't be able to read into `std::string` like that, because `.data()` returns `const char*`, and `.read()` needs `char *`. Also it's probably `UB`. — Nazar554, Nov 10 '14 at 19:08
@Nazar554 : Correct, but `input_fstream.read(&s[0], s.size());` is legal in C++11/C++14. — ildjarn, Nov 10 '14 at 20:14

Matthieu M. · Answer 4 · 2014-11-12T18:37:35.030

I actually implemented a quick and dirty binary format parser to read .zip files (following Wikipedia's format description) just last month, and being modern I decided to use C++ templates.

On some specific platforms, a packed struct could work, however there are things it does not handle well... such as fields of variable length. With templates, however, there is no such issue: you can get arbitrarily complex structures (and return types).

A .zip archive is relatively simple, fortunately, so I implemented something simple. Off the top of my head:

using Buffer = std::pair<unsigned char const*, size_t>;

template <typename OffsetReader>
class UInt16LEReader: private OffsetReader {
public:
    UInt16LEReader() {}
    explicit UInt16LEReader(OffsetReader const or): OffsetReader(or) {}

    uint16_t read(Buffer const& buffer) const {
        OffsetReader const& or = *this;

        size_t const offset = or.read(buffer);
        assert(offset <= buffer.second && "Incorrect offset");
        assert(offset + 2 <= buffer.second && "Too short buffer");

        unsigned char const* begin = buffer.first + offset;

        // http://commandcenter.blogspot.fr/2012/04/byte-order-fallacy.html
        return (uint16_t(begin[0]) << 0)
             + (uint16_t(begin[1]) << 8);
    }
}; // class UInt16LEReader

// Declined for UInt[8|16|32][LE|BE]...

Of course, the basic OffsetReader actually has a constant result:

template <size_t O>
class FixedOffsetReader {
public:
    size_t read(Buffer const&) const { return O; }
}; // class FixedOffsetReader

and since we are talking templates, you can switch the types at leisure (you could implement a proxy reader which delegates all reads to a shared_ptr which memoizes them).

What is interesting, though, is the end-result:

// http://en.wikipedia.org/wiki/Zip_%28file_format%29#File_headers
class LocalFileHeader {
public:
    template <size_t O>
    using UInt32 = UInt32LEReader<FixedOffsetReader<O>>;
    template <size_t O>
    using UInt16 = UInt16LEReader<FixedOffsetReader<O>>;

    UInt32< 0> signature;
    UInt16< 4> versionNeededToExtract;
    UInt16< 6> generalPurposeBitFlag;
    UInt16< 8> compressionMethod;
    UInt16<10> fileLastModificationTime;
    UInt16<12> fileLastModificationDate;
    UInt32<14> crc32;
    UInt32<18> compressedSize;
    UInt32<22> uncompressedSize;

    using FileNameLength = UInt16<26>;
    using ExtraFieldLength = UInt16<28>;

    using FileName = StringReader<FixedOffsetReader<30>, FileNameLength>;

    using ExtraField = StringReader<
        CombinedAdd<FixedOffsetReader<30>, FileNameLength>,
        ExtraFieldLength
    >;

    FileName filename;
    ExtraField extraField;
}; // class LocalFileHeader

This is rather simplistic, obviously, but incredibly flexible at the same time.

An obvious axis of improvement would be to improve chaining since here there is a risk of accidental overlaps. My archive reading code worked the first time I tried it though, which was evidence enough for me that this code was sufficient for the task at hand.

This is in my opinion the most adequate answer! The question asked for modern C++. It is not modern to be compiler-dependent. — Jonas Wolf, Jul 31 '16 at 09:28

score 4 · Answer 5 · answered Nov 10 '14 at 14:29

4

I had to solve this problem once. The data files were packed FORTRAN output. Alignments were all wrong. I succeeded with preprocessor tricks that did automatically what you are doing manually: unpack the raw data from a byte buffer to a struct. The idea is to describe the data in an include file:

BEGIN_STRUCT(foo)
    UNSIGNED_SHORT(length)
    STRING_FIELD(length, label)
    UNSIGNED_INT(stride)
    FLOAT_ARRAY(3 * stride)
END_STRUCT(foo)

Now you can define these macros to generate the code you need, say the struct declaration, include the above, undef and define the macros again to generate unpacking functions, followed by another include, etc.

NB I first saw this technique used in gcc for abstract syntax tree-related code generation.

If CPP is not powerful enough (or such preprocessor abuse is not for you), substitute a small lex/yacc program (or pick your favorite tool).

It's amazing to me how often it pays to think in terms of generating code rather than writing it by hand, at least in low level foundation code like this.

answered Nov 10 '14 at 14:29

Gene

46,253
4
58
96

Sorry, I forgot to mention explicitly that string and float array is not known at compile time. – nikitablack Nov 10 '14 at 14:32
Having seen this code in production, I don't think this is good advice. This is very difficult to understand and troubleshoot/debug. – slaphappy Nov 10 '14 at 14:44
@nikitablack It works fine with dynamic length fields. I can't show you on my mobile. – Gene Nov 10 '14 at 19:07
1

@Mr. kbok He used this technique in the code for gcc. – Gene Nov 10 '14 at 20:48
4

So Richard Stallman used this technique in the late 80's, on software reputed for its unmaintainability, and this is supposed to be a good, modern C++ way of doing this? – slaphappy Nov 11 '14 at 11:58
2

We did not find this method to be at all difficult to maintain in a system with about 80 struct types to pack and unpack. I don't think Richard's coding choice has anything to do with the maintaiability issues of gcc. As I said, if you don't like the C preprocessor method, then write your own translator. Thousands of lines of repetitious byte mangling code is poor practice. – Gene Nov 11 '14 at 16:58
@Gene: while in C I would understand this strategy, in C++ I advise templates instead. – Matthieu M. Nov 12 '14 at 18:36
(for any at least moderately complex task) I rather code and maintain a code generator in some higher level programming language and drive it with a nice syntactic-noiseless metadata(DSL) than a C++ template system (+macros) with all of its boilerplate nonsense... then there are other people who want to code everything in their language because they *can* (even though it might not be the right tool). I don't blame them, I guess it's a matter of taste... – Karoly Horvath Nov 13 '14 at 20:10

score 3 · Answer 6 · answered Nov 10 '14 at 14:10

3

You should better declare a structure (with 1-byte padding - how - depends on compiler). Write using that structure, and read using same structure. Put only POD in structure, and hence no std::string etc. Use this structure only for file I/O, or other inter-process communication - use normal struct or class to hold it for further use in C++ program.

answered Nov 10 '14 at 14:10

Ajay

18,086
12
59
105

1

But how can I declare a structure if I don't know the length of data? It can be arbitrary. – nikitablack Nov 10 '14 at 14:17
I assume you need to store records of same data. If dissimilar collection is to be stored, you need to put flag for that also. Let say flag (value) `1` for `BigData` and `2` for `HugeData`. When reading, parse the flag value, and use appropriate struct. – Ajay Nov 10 '14 at 14:19
Oh, I see, but in my case it's not suitable - I have 100500 such data files. Every one is different. – nikitablack Nov 10 '14 at 14:23
And if you have so many files, using streams, doesn't seem good. Use raw API of the OS. – Ajay Nov 10 '14 at 14:25

Barry · Answer 7 · 2014-11-10T14:36:03.163

3

Since all of your data is variable, you can read the two blocks separately and still use casting:

struct id_contents
{
    uint16_t len;
    char id[];
} __attribute__((packed)); // assuming gcc, ymmv

struct data_contents
{
    uint32_t stride;
    float data[];
} __attribute__((packed)); // assuming gcc, ymmv

class my_row
{
    const id_contents* id_;
    const data_contents* data_;
    size_t len;

public:
    my_row(const char* buffer) {
        id_= reinterpret_cast<const id_contents*>(buffer);
        size_ = sizeof(*id_) + id_->len;
        data_ = reinterpret_cast<const data_contents*>(buffer + size_);
        size_ += sizeof(*data_) + 
            data_->stride * sizeof(float); // or however many, 3*float?

    }

    size_t size() const { return size_; }
};

That way you can use Mr. kbok's answer to parse correctly:

const char* buffer = getPointerToDataSomehow();

my_row data1(buffer);
buffer += data1.size();

my_row data2(buffer);
buffer += data2.size();

// etc.

edited Nov 10 '14 at 14:36

answered Nov 10 '14 at 14:30

Barry

286,269
29
621
977

I didn't realize the float data was variable too, so this'll get that part – Barry Nov 10 '14 at 14:36
2

Note: Ending a struct by an array without size is called a "flexible array member". More info on http://stackoverflow.com/questions/2060974/dynamic-array-in-struct-c – slaphappy Nov 10 '14 at 15:00
2

This code makes no effort to ensure the `short`, `int` and `float` data access via `id_` and `data_` will be properly aligned on 2 / 4 / 4 byte memory boundaries, and depending on the hardware may SIGBUS or similar, or suffer misaligned-data read performance penalties.... – Tony Delroy Nov 10 '14 at 16:09

score 3 · Answer 8 · answered Nov 10 '14 at 21:43

I personally do it this way:

// some code which loads the file in memory
#pragma pack(push, 1)
struct someFile { int a, b, c; char d[0xEF]; };
#pragma pack(pop)

someFile* f = (someFile*) (file_in_memory);
int filePropertyA = f->a;

Very effective way for fixed-size structs at the start of the file.

score 2 · Answer 9 · answered Nov 13 '14 at 14:04

2

Use a serialization library. Here are a few:

Boost serialization and Boost fusion
Cereal (my own library)
Another library called cereal (same name as mine but mine predates theirs)
Cap'n Proto

answered Nov 13 '14 at 14:04

Átila Neves

1,351
11
14

You should add a license to your library, otherwise nobody will really think about using it. – Lukas Mar 28 '15 at 10:02

score 2 · Answer 10 · answered Jun 22 '21 at 19:57

The Kaitai Struct library provides a very effective declarative approach, which has the added bonus of working across programming languages.

After installing the compiler, you will want to create a .ksy file that describes the layout of your binary file. For your case, it would look something like this:

# my_type.ksy

meta:
  id: my_type
  endian: be # for big-endian, or "le" for little-endian

seq: # describes the actual sequence of data one-by-one
  - id: len
    type: u2 # unsigned short in C++, two bytes
  - id: my_string
    type: str
    size: 5
    encoding: UTF-8
  - id: stride
    type: u4 # unsigned int in C++, four bytes
  - id: float_data
    type: f4 # a four-byte floating point number
    repeat: expr
    repeat-expr: 6 # repeat six times

You can then compile the .ksy file using the kaitai struct compiler ksc:

# wherever the compiler is installed
# -t specifies the target language, in this case C++
/usr/local/bin/kaitai-struct-compiler my_type.ksy -t cpp_stl

This will create a my_type.cpp file as well as a my_type.h file, which you can then include in your C++ code:


#include <fstream>
#include <kaitai/kaitaistream.h>
#include "my_type.h"

int main()
{
  std::ifstream ifs("my_data.bin", std::ifstream::binary);
  kaitai::kstream ks(&ifs);
  my_type_t obj(&ks);

  std::cout << obj.len() << '\n'; // you can now access properties of the object

  return 0;
}

Hope this helped! You can find the full documentation for Kaitai Struct here. It has a load of other features and is a fantastic resource for binary parsing in general.

Dmitry Ponyatov · Answer 11 · 2019-03-28T10:01:47.650

I use ragel tool to generate pure C procedural source code (no tables) for microcontrollers with 1-2K of RAM. It did not use any file io, buffering, and produces both easy to debug code and .dot/.pdf file with state machine diagram.

ragel can also output go, Java,.. code for parsing, but I did not use these features.

The key feature of ragel is the ability to parse any byte-build data, but you can't dig into bit fields. Other problem is ragel able to parse regular structures but has no recursion and syntax grammar parsing.

Parsing a binary file. What is a modern way?

11 Answers11

Linked