How can I unpack a length-prefixed bytestring?

Question

I used the standard library struct module to pack a bytes object into a string, preceded by length information:

>>> import struct
>>> string = b'-'
>>> t = struct.pack(">h%ds" % len(string), len(string), string)
>>> print(t)
b'\x00\x01-'

Of course I could just remove the length count to get back the original data. But how can I unpack this data, respecting the length count, in order to get back b'-'?

score 5 · Answer 1 · answered Mar 02 '11 at 00:11

Normally you wouldn't use struct.pack to put a length header and the value together. Instead you would just do struct.pack(">h", len(data)), send that over the line (for example in network protocol) and then send the data. No need to create a new bytes buffer.

In your case, you could simply do:

dataLength, = struct.unpack(">h", t[:2])
data = t[2:2+dataLength]

but as I said, if you have a socket-based application for instance, it would be like so:

header = receive(2)
dataLength, = struct.unpack(">h", header)
data = receive(dataLength)

unutbu · Answer 2 · 2011-03-02T00:54:35.287

import struct
string = b'-'
fmt=">h%ds" % len(string)

Here you are packing both the length and the string:

t = struct.pack(fmt, len(string), string)
print(repr(t))
# '\x00\x01-'

So when you unpack, you should expect to get two values back, i.e., the length and the string:

length,string2=struct.unpack(fmt,t)
print(repr(string2))
# '-'

In general, if you don't know how the string was packed, then there is no sure-fire way to recover the data. You'd just have to guess!

If you know the data is composed of the length of the string, and then the string itself, then you could try trial-and-error:

import struct
string = b'-'
fmt=">h%ds" % len(string)
t = struct.pack(fmt, len(string), string)
print(repr(t))

for endian in ('>','<'):
    for fmt,size in (('b',1),('B',1),('h',2),('H',2),('i',4),('I',4),
                     ('l',4),('L',4),('q',8),('Q',8)):
        fmt=endian+fmt
        try:
            length,=struct.unpack(fmt,t[:size])
        except struct.error:
            pass
        else:
            fmt=fmt+'{0}s'.format(length)
            try:
                length,string2=struct.unpack(fmt,t)
            except struct.error:
                pass
            else:
                print(fmt,length,string2)
# ('>h1s', 1, '-')
# ('>H1s', 1, '-')

It might be possible to compose an ambiguous string t which has multiple valid unpackings which would lead to different string2s, however. I'm not sure.

Okay. Now what if you don't know the actual length of string? — dbdii407, Mar 02 '11 at 00:26

score 1 · Answer 3 · answered Mar 02 '11 at 00:29

1

The struct module is designed for fixed-format blocks of data. However you can use the following code:

import struct
t=b'\x00\x01-'
(length,)=struct.unpack_from(">h", t)
(text,)=struct.unpack_from("%ds"%length, t, struct.calcsize(">h"))
print text

answered Mar 02 '11 at 00:29

bcmpinc

3,202
29
36

`struct.calcsize(">h")` is a rather verbose way of writing `2` – John Machin Mar 02 '11 at 08:08

score 1 · Answer 4 · edited Jan 28 '23 at 12:49

Suppose data is a big chunk of bytes and you have successfully parsed out the first posn bytes. The documentation for this chunk of bytes says that the next item is a string of bytes preceded by a 16-bit signed (unlikely, but you did say h format) big-endian integer.

Here's what to do:

nbytes, = struct.unpack('>h', data[posn:posn+2])
posn += 2
the_string = data[posn:posn+nbytes]
posn += nbytes

and now you're positioned ready for the next item.

In Python 2.5 and up, you can use unpack_from() instead of slicing.

Karl Knechtel · Answer 5 · 2023-01-28T12:40:17.747

Generally, in "binary formats" like this, the purpose of prefixing a length to some data is so that the unpacking code knows how much data there is.

However, it is not possible to unpack the entire thing in one go with struct.unpack - because the struct module uses formats that are computed ahead of time. That's fine on the packing side, because all the data is available. It doesn't work on the unpacking side, because the information needs to be discovered on the fly.

In other words: when we unpack data like b'\x00\x01-', knowing that it was packed with an approach like in the OP example code, we cannot create a format string in advance that is correct for the data. To make that string, we need the length, but the length is in the data.

`struct.unpack_from`

So, unavoidably, we will need to make two separate attempts to read data. Since we want to consider only part of the data, we use struct.unpack_from rather than struct.unpack. The simplest approach is as follows:

Unpack the first two bytes from the beginning, to find out the length.
Using that length, unpack however many bytes that is, starting from just after the length count.

As explained in the documentation:

struct.unpack_from(format, /, buffer, offset=0)

Unpack from buffer starting at position offset, according to the format string format. The result is a tuple even if it contains exactly one item. The buffer’s size in bytes, starting at position offset, must be at least the size required by the format, as reflected by calcsize().

Thus:

>>> length_format = '>h'
>>> length, = struct.unpack_from('>h', t)
>>> data, = struct.unpack_from(f'{length}s', t, 2)
>>> data
b'-'

Note the trailing commas: these are used to unpack the tuples (not in the struct unpacking sense, but the native Python sense returned by struct.unpack_from.

The 2 in the second call, of course, accounts for the amount of data unpacked the first time. For more general cases, or if this is seen as too magical, the amount of data unpacked can be computed by calling struct.calcsize on the format string.

Historical notes

This question was asked a long time ago, and modern tools may not have been available.

Before Python 3.6, it would be necessary to call .format on the string, rather than using an f-string, to create the second format string; thus, '{}s'.format(length). Before 2.6, it would be necessary to use the same %-style formatting as in OP: '%ds' % length.

Before 2.5, struct.unpack_from was not available. To work around this, explicitly slice the string appropriately, and then use unpack:

>>> length_format = '>h'
>>> size = struct.calcsize(length_format)
>>> length, = struct.unpack(length_format, t[:size])
>>> length_size = struct.calcsize(length_format)
>>> length, = struct.unpack(length_format, t[:length_size])
>>> data, = struct.unpack(f'%ds' % length, t[length_size:])
>>> data
b'-'

Other considerations: streaming data, and handling data one type at a time

Of course, all functionality in struct expects to work on a buffer. If data is coming in from a binary stream (such as a network connection, or a file opened in binary mode), it would have to be read fully before struct.unpack or struct.unpack_from could be used. This potentially wastes a lot of memory, and makes little sense considering that we need to consider the data in two separate steps anyway.

Let's model an input stream:

>>> import io
>>> stream = io.BytesIO(t)

Since each read will happen concurrently, we don't need to track an offset. Instead, we just read the appropriate amount of bytes each time. Using the struct module:

>>> length_format = '>h'
>>> length_size = struct.calcsize(length_format)
>>> length, = struct.unpack(length_format, stream.read(length_size))
>>> data, = struct.unpack(f'{length}s', stream.read(length))
>>> data
b'-'

But now it should be fairly obvious that the struct module is overkill for the task of interpreting the data. The first read is just a couple of bytes representing an integer; the int type already knows how to interpret that. As for the second read, stream.read(length) is already the desired data, so there is no reason to do any more processing. Thus:

>>> length = int.from_bytes(stream.read(length_size), 'big')
>>> data = stream.read(length)
>>> data
b'-'

Santa · Answer 6 · 2011-03-02T00:18:28.377

-1

How exactly are you unpacking?

>>> string = b'-'
>>> format = '>h%ds' % len(string)
>>> format
'>h1s'
>>> struct.calcsize(format)
3

For unpack(fmt, string), len(string) must equal struct.calcsize(fmt). So it's not possible for an unpacked data to be just '-'.

But:

>>> t = b'\x00\x01-'
>>> length, data = struct.unpack(format, t)
>>> length, data
(1, '-')

Now you can use data.

edited Mar 02 '11 at 00:18

answered Mar 02 '11 at 00:12

Santa

11,381
8
51
64

How can I unpack a length-prefixed bytestring?

6 Answers6

struct.unpack_from

Historical notes

Other considerations: streaming data, and handling data one type at a time

`struct.unpack_from`