14

I have been recently working on a scripts that takes file, chunks it and analyze each piece. Because the chunking positions depend on the content, I need to read it one byte at a time. I do not need random access, just reading it linearly from beginning to end, selecting certain positions as I go and yielding the content of the chunk from the previous selected position to the current one.

It was very convenient to use a memory mapped file wrapped by a bytearray. Instead of yielding the chunk, I yield the offset and size of the chunk, leaving the outer function to slice it.

It was also faster than accumulating the current chunk in a bytearray (and much faster than accumulating in bytes!). But I have certain concerns that I would like to address:

  1. Is bytearray copying the data?
  2. I open the file as rb and the mmap with access=mmap.ACCESS_READ. But bytearray is, in principle, a mutable container. Is this a performance problem? Is there a read only container that I should use?
  3. Because I do not accumulate in the buffer, I am random accessing the bytearray (and therefore the underlying file). Even though it might be buffered, I am afraid that there will problems depending on the file size and system memory. Is this really a problem?
Hernan
  • 5,811
  • 10
  • 51
  • 86
  • 1
    Are you able to read the sources? https://www.python.org/downloads/source/ it is in the Objects folder. – User Nov 02 '14 at 06:26
  • @User Thanks for the tip. So bytearray is copying the data. I am using bytearray to avoid calling `ord` in each step of the loop. In a way, what I would need is something like `numpy.memmap(, dtype='uint8', mode='r')` which allows me to iterate throught the bytes (in the integer representation) – Hernan Nov 02 '14 at 15:08
  • 1
    What Python version? Also, how are you wrapping `mmap` in your `bytearray`? – Veedrac Nov 02 '14 at 16:46
  • 1
    @Veedrac I am targeting 2.7 and 3.4. Right now, I am just doing bytearray(mmap()) – Hernan Nov 02 '14 at 22:46
  • Ah, I'm pretty sure that the call copies the contents of the `mmap`. `bytearray` can't "wrap" `mmap` like that. If you want to support both 2.7 and 3.4 you'll probably be better off using Numpy (eg. [`numpy.memmap`](http://docs.scipy.org/doc/numpy/reference/generated/numpy.memmap.html)). This supports copy-free slicing. I've not used it but I would note the "An alternative to using this" section. – Veedrac Nov 03 '14 at 01:43
  • @Veedrac Hi have use memmap and works nicely, but I would like something that does not requires numpy. Will keep looking, thanks a lot! – Hernan Nov 03 '14 at 02:43
  • 1
    On Python 3 you can use `memoryview` but `mmap` doesn't support the `memoryview` protocol on 2.x. – Veedrac Nov 03 '14 at 02:48
  • for the first question, I assume bytearray doesn't copy the data. bytearray contains numbers between 0-256. The current implementation **keeps an array of integer objects for all integers between -5 and 256**, when you create an int **in that range you actually just get back a reference to the existing object**. [Stackoverflow question Reference](http://stackoverflow.com/questions/3402679/identifying-objects-why-does-the-returned-value-from-id-change) – Tal Nov 10 '14 at 08:47
  • 2
    Although your explanation is pretty detailed, a piece of code would be very helpful. – Alex Nov 15 '14 at 18:15

2 Answers2

1
  1. Converting one object to a mutable object does incur data copying. You can directly read the file to a bytearray by using:

    f = open(FILENAME, 'rb')
    data = bytearray(os.path.getsize(FILENAME))
    f.readinto(data)
    

from http://eli.thegreenplace.net/2011/11/28/less-copies-in-python-with-the-buffer-protocol-and-memoryviews#id12

  1. There is a string to bytearray conversion, so there is potential performance issue.

  2. bytearray is an array, so it can hit the limit of PY_SSIZE_T_MAX/sizeof(PyObject*). For more info, you can visit How Big can a Python Array Get?

Community
  • 1
  • 1
snowblade
  • 23
  • 5
0

You could do this little hack.

import mmap

class memmap(mmap.mmap):
    def read_byte(self):
        return ord(super(memmap,self).read_byte())

Create a class that inherits from the mmap class and overwrites the default read_byte that returns a string of length 1 to one that returns a int. And then you could use this class as any other mmap class.

I hope this helps.

Alex
  • 467
  • 4
  • 13