Generally, in "binary formats" like this, the purpose of prefixing a length to some data is so that the unpacking code knows how much data there is.
However, it is not possible to unpack the entire thing in one go with struct.unpack
- because the struct
module uses formats that are computed ahead of time. That's fine on the packing side, because all the data is available. It doesn't work on the unpacking side, because the information needs to be discovered on the fly.
In other words: when we unpack data like b'\x00\x01-'
, knowing that it was packed with an approach like in the OP example code, we cannot create a format string in advance that is correct for the data. To make that string, we need the length, but the length is in the data.
struct.unpack_from
So, unavoidably, we will need to make two separate attempts to read data. Since we want to consider only part of the data, we use struct.unpack_from
rather than struct.unpack
. The simplest approach is as follows:
- Unpack the first two bytes from the beginning, to find out the length.
- Using that length, unpack however many bytes that is, starting from just after the length count.
As explained in the documentation:
struct.
unpack_from
(
format, /, buffer, offset=0
)
Unpack from buffer starting at position offset, according to the format string format. The result is a tuple even if it contains exactly one item. The buffer’s size in bytes, starting at position offset, must be at least the size required by the format, as reflected by calcsize()
.
Thus:
>>> length_format = '>h'
>>> length, = struct.unpack_from('>h', t)
>>> data, = struct.unpack_from(f'{length}s', t, 2)
>>> data
b'-'
Note the trailing commas: these are used to unpack the tuples (not in the struct unpacking sense, but the native Python sense returned by struct.unpack_from
.
The 2
in the second call, of course, accounts for the amount of data unpacked the first time. For more general cases, or if this is seen as too magical, the amount of data unpacked can be computed by calling struct.calcsize
on the format string.
Historical notes
This question was asked a long time ago, and modern tools may not have been available.
Before Python 3.6, it would be necessary to call .format
on the string, rather than using an f-string, to create the second format string; thus, '{}s'.format(length)
. Before 2.6, it would be necessary to use the same %
-style formatting as in OP: '%ds' % length
.
Before 2.5, struct.unpack_from
was not available. To work around this, explicitly slice the string appropriately, and then use unpack
:
>>> length_format = '>h'
>>> size = struct.calcsize(length_format)
>>> length, = struct.unpack(length_format, t[:size])
>>> length_size = struct.calcsize(length_format)
>>> length, = struct.unpack(length_format, t[:length_size])
>>> data, = struct.unpack(f'%ds' % length, t[length_size:])
>>> data
b'-'
Other considerations: streaming data, and handling data one type at a time
Of course, all functionality in struct
expects to work on a buffer. If data is coming in from a binary stream (such as a network connection, or a file opened in binary mode), it would have to be read fully before struct.unpack
or struct.unpack_from
could be used. This potentially wastes a lot of memory, and makes little sense considering that we need to consider the data in two separate steps anyway.
Let's model an input stream:
>>> import io
>>> stream = io.BytesIO(t)
Since each read will happen concurrently, we don't need to track an offset. Instead, we just read the appropriate amount of bytes each time. Using the struct
module:
>>> length_format = '>h'
>>> length_size = struct.calcsize(length_format)
>>> length, = struct.unpack(length_format, stream.read(length_size))
>>> data, = struct.unpack(f'{length}s', stream.read(length))
>>> data
b'-'
But now it should be fairly obvious that the struct
module is overkill for the task of interpreting the data. The first read is just a couple of bytes representing an integer; the int
type already knows how to interpret that. As for the second read, stream.read(length)
is already the desired data, so there is no reason to do any more processing. Thus:
>>> length = int.from_bytes(stream.read(length_size), 'big')
>>> data = stream.read(length)
>>> data
b'-'