22

So, I'm getting this data. From the network socket, or out of a file. I'm cobbling together code that will interpret the data. Read some bytes, check some flags, and some bytes indicate how much data follows. Read in that much data, rinse, repeat.

This task reminds me much to parsing source code. I'm comfy with lex/yacc and antlr, but they're not up to this task. You can't specify bits and raw bytes as tokens (well, maybe you could, but I wouldn't know how), and you can't coax them into "read two bytes, make them into an unsigned 16bit integer, call it n, and then read n bytes.".

Then again, when the spec of the protocol/data format is defined in a systematic manner (not all of them are), there should be a systematic way to read in data that is formatted according to the protocol. Right?

There's gotta be a tool that does that.

doppelfish
  • 963
  • 7
  • 18
  • voting to reopen because this is less about seeking recommendations and more about finding out whether or not a certain kind of technology exists. i found this page very relevant and helpful, and i hate to think it is hidden from those not having the close/reopen privilege – pestophagous Feb 25 '21 at 23:44

8 Answers8

12

Kaitai Struct initiative have emerged recently to solve exactly that task: to generate binary parsers from a spec. You can provide a scheme for serialization of arbitrary data structure in a YAML/JSON-based format like that:

meta:
  id: my_struct
  endian: le
seq:
  - id: some_int
    type: u4
  - id: some_string
    type: str
    encoding: UTF-8
    size: some_int + 4
  - id: another_int
    type: u4

compile it using ksc (they provide a reference compiler implementation), and, voila, you've got a parser in any supported programming language, for example, in C++:

my_struct_t::my_struct_t(kaitai::kstream *p_io, kaitai::kstruct *p_parent, my_struct_t *p_root) : kaitai::kstruct(p_io) {
    m__parent = p_parent;
    m__root = this;
    m_some_int = m__io->read_u4le();
    m_some_string = m__io->read_str_byte_limit((some_int() + 4), "UTF-8");
    m_another_int = m__io->read_u4le();
}

or in Java:

private void _parse() throws IOException {
    this.someInt = this._io.readU4le();
    this.someString = this._io.readStrByteLimit((someInt() + 4), "UTF-8");
    this.anotherInt = this._io.readU4le();
}

After adding that to your project, it provides a very intuitive API like that (an example in Java, but they support more languages):

// given file.dat contains 01 00 00 00|41 42 43 44|07 01 00 00

MyStruct s = MyStruct.fromFile("path/to/file.dat");
s.someString() // => "ABCD"
s.anotherInt() // => 263 = 0x107

It supports different endianness, conditional structures, substructures, etc, and lots more. Pretty complex data structures, such as PNG image file format or PE executable can be parsed.

dpm_min
  • 315
  • 3
  • 6
9

You may try to employ Boost.Spirit (v2) which has recently got binary parsing tools, endianness-aware native and mixed parsers

// This is not a complete and useful example, but just illustration that parsing
// of raw binary to real data components is possible
typedef boost::uint8_t byte_t;
byte_t raw[16] = { 0 };
char const* hex = "01010000005839B4C876BEF33F83C0CA";
my_custom_hex_to_bytes(hex, raw, 16);

// parse raw binary stream bytes to 4 separate words
boost::uint32_t word(0);
byte_t* beg = raw;
boost::spirit::qi::parse(beg, beg + 16, boost::spirit::qi::dword, word))

UPDATE: I found similar question, where Joel de Guzman confirms in his answer availability of binary parsers: Can Boost Spirit be used to parse byte stream data?

Community
  • 1
  • 1
mloskot
  • 37,086
  • 11
  • 109
  • 136
3

The Construct parser, written in Python, has done some interesting work in this field.

The project has had a number of authors, and periods of stagnation, but as of 2017 it seems to be more active again.

Craig McQueen
  • 41,871
  • 30
  • 130
  • 181
  • 1
    Looks like someone picked up maintenance recently. They have [new docs](http://construct.readthedocs.org/en/latest/) up too. – dowski Nov 27 '12 at 16:32
1

https://github.com/lonelybug/ddprotocol/wiki/Getting-Start

hope this is helpful to you.

noname
  • 11
  • 1
1

Read up on ASN.1. If you can describe the binary data in its terms, you can then use various available kits. Not for the faint of heart.

bmargulies
  • 97,814
  • 39
  • 186
  • 310
0

There is certainly nothing stopping you from writing a recursive decent parser, say, for binary data the same way you would hand-tool a text parser. If the format you need to read is not too complicated this is a reasonable way to proceed.

Of course, if you format is very simple you could take a look at Reading binary file defined by a struct and similar question.

I don't know of any parser generators for non-text input, though those are also possible.


In the event that you are not familiar with coding parsers by hand, the canonical SO question is Learning to write a compiler. The Crenshaw tutorial (and in PDF) is a fast read.

Community
  • 1
  • 1
dmckee --- ex-moderator kitten
  • 98,632
  • 24
  • 142
  • 234
  • Well, I *am* essentially writing a parser that way. By hand, that is. But am I the first person on earth to do this? – doppelfish Feb 06 '10 at 22:15
0

There is a tool called binpac that does exactly this.

http://www.icsi.berkeley.edu/pubs/networking/binpacIMC06.pdf

Albin Stigo
  • 2,082
  • 2
  • 17
  • 14
0

See also google protocol buffers.

fche
  • 2,641
  • 20
  • 28
  • 1
    This is really a comment, not an answer to the question. Please use "add comment" to leave feedback for the author. – Thor Aug 15 '12 at 22:28