Question up front:
Is there a pythonic way in the standard library for parsing raw binary files using for ... in ...
syntax (i.e., __iter__
/__next__
) that yields blocks that respect the buffersize
parameter, without having to subclass IOBase
or its child classes?
Detailed explanation
I'd like to open a raw file for parsing, making use of the for ... in ...
syntax, and I'd like that syntax to yield predictably shaped objects. This wasn't happening as expected for a problem I was working on, so I tried the following test (import numpy as np
required):
In [271]: with open('tinytest.dat', 'wb') as f:
...: f.write(np.random.randint(0, 256, 16384, dtype=np.uint8).tobytes())
...:
In [272]: np.array([len(b) for b in open('tinytest.dat', 'rb', 16)])
Out[272]:
array([ 13, 138, 196, 263, 719, 98, 476, 3, 266, 63, 51,
241, 472, 75, 120, 137, 14, 342, 148, 399, 366, 360,
41, 9, 141, 282, 7, 159, 341, 355, 470, 427, 214,
42, 1095, 84, 284, 366, 117, 187, 188, 54, 611, 246,
743, 194, 11, 38, 196, 1368, 4, 21, 442, 169, 22,
207, 226, 227, 193, 677, 174, 110, 273, 52, 357])
I could not understand why this random behavior was arising, and why it was not respecting the buffersize
argument. Using read1
gave the expected number of bytes:
In [273]: with open('tinytest.dat', 'rb', 16) as f:
...: b = f.read1()
...: print(len(b))
...: print(b)
...:
16
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n\x0f8}'
And there it is: A newline near the end of the first block.
In [274]: with open('tinytest.dat', 'rb', 2048) as f:
...: print(f.readline())
...:
b'M\xfb\xea\xc0X\xd4U%3\xad\xc9u\n'
Sure enough, readline
was being called to produce each block of the file, and it was tripping up on the newline value (corresponding to 10). I verified this reading through the code, lines in the definition of IOBase:
571 def __next__(self):
572 line = self.readline()
573 if not line:
574 raise StopIteration
575 return line
So my question is this: is there some more pythonic way to achieve buffersize
-respecting raw file behavior that allows for ... in ...
syntax, without having to subclass IOBase
or its child classes (and thus, not being part of the standard library)? If not, does this unexpected behavior warrant a PEP? (Or does it warrant learning to expect the behavior?:)