Slicing a file in Python

Question

I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.

It was very convenient to use a memory mapped file wrapped by a bytearray. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.

It was also faster than accumulating the current chunk in a bytearray (and much faster than accumulating in bytes!). But I have certain concerns that I would like to address:

Is bytearray copying the data?
I open the file as rb and the mmap with access=mmap.ACCESS_READ. But bytearray is, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use?
Because I do not accumulate in the buffer, I am random accessing the bytearray (and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?

Are you able to read the sources? https://www.python.org/downloads/source/ it is in the Objects folder. — User, Nov 02 '14 at 06:26
@User Thanks for the tip. So bytearray is copying the data. I am using bytearray to avoid calling `ord` in each step of the loop. In a way, what I would need is something like `numpy.memmap(, dtype='uint8', mode='r')` which allows me to iterate throught the bytes (in the integer representation) — Hernan, Nov 02 '14 at 15:08
What Python version? Also, how are you wrapping `mmap` in your `bytearray`? — Veedrac, Nov 02 '14 at 16:46
@Veedrac I am targeting 2.7 and 3.4. Right now, I am just doing bytearray(mmap()) — Hernan, Nov 02 '14 at 22:46
Ah, I'm pretty sure that the call copies the contents of the `mmap`. `bytearray` can't "wrap" `mmap` like that. If you want to support both 2.7 and 3.4 you'll probably be better off using Numpy (eg. [`numpy.memmap`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html)). This supports copy-free slicing. I've not used it but I would note the "An alternative to using this" section. — Veedrac, Nov 03 '14 at 01:43
@Veedrac Hi have use memmap and works nicely, but I would like something that does not requires numpy. Will keep looking, thanks a lot! — Hernan, Nov 03 '14 at 02:43
On Python 3 you can use `memoryview` but `mmap` doesn't support the `memoryview` protocol on 2.x. — Veedrac, Nov 03 '14 at 02:48
for the first question, I assume bytearray doesn't copy the data. bytearray contains numbers between 0-256. The current implementation **keeps an array of integer objects for all integers between -5 and 256**, when you create an int **in that range you actually just get back a reference to the existing object**. [Stackoverflow question Reference](http://stackoverflow.com/questions/3402679/identifying-objects-why-does-the-returned-value-from-id-change) — Tal, Nov 10 '14 at 08:47
Although your explanation is pretty detailed, a piece of code would be very helpful. — Alex, Nov 15 '14 at 18:15

score 1 · Answer 1 · edited May 23 '17 at 12:09

Converting one object to a mutable object does incur data copying. You can directly read the file to a bytearray by using:
```
f = open(FILENAME, 'rb')
data = bytearray(os.path.getsize(FILENAME))
f.readinto(data)
```

from http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews#id12

There is a string to bytearray conversion, so there is potential performance issue.
bytearray is an array, so it can hit the limit of PY_SSIZE_T_MAX/sizeof(PyObject*). For more info, you can visit How Big can a Python Array Get?

score 0 · Answer 2 · answered Nov 15 '14 at 18:34

You could do this little hack.

import mmap

class memmap(mmap.mmap):
    def read_byte(self):
        return ord(super(memmap,self).read_byte())

Create a class that inherits from the mmap class and overwrites the default read_byte that returns a string of length 1 to one that returns a int. And then you could use this class as any other mmap class.

I hope this helps.

Slicing a file in Python

2 Answers2