I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.
It was very convenient to use a memory mapped file wrapped by a bytearray
. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.
It was also faster than accumulating the current chunk in a bytearray
(and much faster than accumulating in bytes
!). But I have certain concerns that I would like to address:
- Is bytearray copying the data?
- I open the file as
rb
and themmap
withaccess=mmap.ACCESS_READ
. Butbytearray
is, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use? - Because I do not accumulate in the buffer, I am random accessing the
bytearray
(and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?