I have binary data that is stored in a non-trivial format where the information 'chunks' are not a fixed size and are similar to packets. I am reading them dynamically using this function:
def unpack_bytes(stream: BytesIO, binary_format: str) -> tuple:
size = struct.calcsize(binary_format)
buf = stream.read(size)
print(buf)
return struct.unpack(binary_format, buf)
This function is called with the appropriate format as needed and the code that creates the stream and loops over it is as follows:
def parse_data_file(data_directory: str) -> Generator[CompressedFile]:
with open(data_directory, 'rb') as packet_stream:
while <EOF file logic here>:
contents = parse_packet(packet_stream)
contents = gzip.compress(data=contents, compresslevel=9)
yield CompressedFile(filename=f"{uuid.uuid4()}.gz", datetime=datetime.now(),
contents=contents)
CompressedFile
is just a small dataclass to store the
parse_packet
extracts a single packet (as per the data spec) from the bin file and returns the contents. Since the packets don't have a fixed width I am wondering what the best way to stop the loop would be. The two options I know of are:
- Add some extra logic to
unpack_bytes()
to bubble up an EOF. - Do some cursor-foo to save the EOF and check against it as it loops. I'd like to not manipulate the cursor directly if possible
Is there are more idomatic way to check EOF within parse_data_file
?
The last call to parse_packet
(and by extension the last call to unpack_bytes
) will consume all the data and the cursor will be at the end when the next iteration of the loop begins. I'd like to take advantage of that state instead of adding EOF handling code all the way up from unpack_bytes
or fiddling with the cursor directly.