49

[Edit: This problem applies only to 32-bit systems. If your computer, your OS and your python implementation are 64-bit, then mmap-ing huge files works reliably and is extremely efficient.]

I am writing a module that amongst other things allows bitwise read access to files. The files can potentially be large (hundreds of GB) so I wrote a simple class that lets me treat the file like a string and hides all the seeking and reading.

At the time I wrote my wrapper class I didn't know about the mmap module. On reading the documentation for mmap I thought "great - this is just what I needed, I'll take out my code and replace it with an mmap. It's probably much more efficient and it's always good to delete code."

The problem is that mmap doesn't work for large files! This is very surprising to me as I thought it was perhaps the most obvious application. If the file is above a few gigabytes then I get an EnvironmentError: [Errno 12] Cannot allocate memory. This only happens with a 32-bit Python build so it seems it is running out of address space, but I can't find any documentation on this.

My code is just

f = open('somelargefile', 'rb')
map = mmap.mmap(f.fileno(), 0, access=mmap.ACCESS_READ)

So my question is am I missing something obvious here? Is there a way to get mmap to work portably on large files or should I go back to my naïve file wrapper?


Update: There seems to be a feeling that the Python mmap should have the same restrictions as the POSIX mmap. To better express my frustration here is a simple class that has a small part of the functionality of mmap.

import os

class Mmap(object):
    def __init__(self, f):
        """Initialise with a file object."""
        self.source = f

    def __getitem__(self, key):
        try:
            # A slice
            self.source.seek(key.start, os.SEEK_SET)
            return self.source.read(key.stop - key.start)
        except AttributeError:
            # single element
            self.source.seek(key, os.SEEK_SET)
            return self.source.read(1)

It's read-only and doesn't do anything fancy, but I can do this just the same as with an mmap:

map2 = Mmap(f)
print map2[0:10]
print map2[10000000000:10000000010]

except that there are no restrictions on filesize. Not too difficult really...

Yaakov Belch
  • 4,692
  • 33
  • 39
Scott Griffiths
  • 21,438
  • 8
  • 55
  • 85
  • But it doesn't have the functionality of mmap. mmap exposes a buffer interface, and you can do regexp matching against it. mmap supports writing to the file, and mmap supports shared memory. You code, and even your approach, won't do that. – Andrew Dalke Nov 05 '09 at 17:12
  • 2
    Well it has a *small* amount of mmap's functionality but without suffering from the address space limitation. It's only a toy piece of code - I'm not claiming it's a replacement! I don't see a problem with this approach imitating the functionality of mmap, although I can understand it can't match the performance. – Scott Griffiths Nov 05 '09 at 17:42
  • 3
    Because it *can't* implement the functionality of mmap. How would you implement IPC with this, so a child process can communicate with the parent through a shared memory block? Also, your example is not thread-safe, since two __getitem__ calls in different threads can happen such that the seek for the second occurs immediately after the seek for the first, causing the read for the first to give the wrong result. – Andrew Dalke Nov 06 '09 at 01:16
  • 1
    @dalke: OK, I give in! As I've amply demonstrated I don't know a lot about the POSIX mmap. I only need a subset of the functionality (no threading etc.) which I can do fairly simply. I'll take your word for it about the rest :) – Scott Griffiths Nov 06 '09 at 12:54

8 Answers8

39

From IEEE 1003.1:

The mmap() function shall establish a mapping between a process' address space and a file, shared memory object, or [TYM] typed memory object.

It needs all the virtual address space because that's exactly what mmap() does.

The fact that it isn't really running out of memory doesn't matter - you can't map more address space than you have available. Since you then take the result and access as if it were memory, how exactly do you propose to access more than 2^32 bytes into the file? Even if mmap() didn't fail, you could still only read the first 4GB before you ran out of space in a 32-bit address space. You can, of course, mmap() a sliding 32-bit window over the file, but that won't necessarily net you any benefit unless you can optimize your access pattern such that you limit how many times you have to visit previous windows.

Nick Bastin
  • 30,415
  • 7
  • 59
  • 78
  • But that's the POSIX mmap. The IEEE isn't relevant. The Python module of the same name doesn't have to operate in the same way, and I can't see any documentation that says that it does. Perhaps I should add some code to my question to clarify... – Scott Griffiths Nov 02 '09 at 17:06
  • 20
    The POSIX mmap spec is *absolutely* relevant. The whole point of the Python mmap module is to give you direct access to the operating system's mmap, allowing hw pointer access to file data as if it were memory. If you want more convenience, use the many other IO-related modules in the Python library or any other language. Otherwise you need to live with the constraints of the underlying OS and CPU virtual memory architecture. – Ned Deily Nov 02 '09 at 17:13
  • So what does Python's mmap do on Windows? I don't mean to be dense, but the documentation for Python's mmap doesn't mention POSIX or direct access to the operating system so I consider that to be an implementation detail. (But you're right in that it's likely to be one I have to live with:) – Scott Griffiths Nov 02 '09 at 17:25
  • 2
    Windows implements POSIX api calls. POSIX mmap does the same thing on Windows as on Linux: it maps the file into the virtual address space. – mch Nov 02 '09 at 17:46
  • 2
    If you haven't already, read http://en.wikipedia.org/wiki/Mmap and note the note about Windows MapViewOfFile; looking at the code for the python Modules/mmapmodule.c, that's what it uses on Windows. BTW, suggestions for improving the Python documentation are always welcome at bugs.python.org. – Ned Deily Nov 02 '09 at 17:52
  • 1
    On windows, python wraps mmap on top of the MapViewOfFile win32 call, whch operates very similar to *nix mmap. The documentation has a few notes on the differences between unix/windows regarding mmap. mmap is part of Pythons "Optional Operating System Services", whose whole point is to wrap common operating system features, and are thus subject to restrictions of the underlying OS. – nos Nov 02 '09 at 17:52
  • 1
    Thanks guys, I guess much of the problem is the Python documentation not being explicit enough. – Scott Griffiths Nov 02 '09 at 18:01
18

Sorry to answer my own question, but I think the real problem I had was not realising that mmap was a standard POSIX system call with particular characterisatations and limitations and that the Python mmap is supposed just to expose its functionality.

The Python documentation doesn't mention the POSIX mmap and so if you come at it as a Python programmer without much knowledge of POSIX (as I did) then the address space problem appears quite arbitrary and badly designed!

Thanks to the other posters for teaching me the true meaning of mmap. Unfortunately no one has suggested a better alternative to my hand-crafted class for treating large files as strings, so I shall have to stick with it for now. Perhaps I will clean it up and make it part of my module's public interface when I get the chance.

Scott Griffiths
  • 21,438
  • 8
  • 55
  • 85
  • 9
    Seems to me that your hand-crafted class is good fit for your needs. There is no compulsion to use unsuitable mechanisms just because they are part of the environment. Thanks for sharing the learning experience. You've saved me from re-inventing the same set of problems. – CyberFonic Feb 25 '10 at 01:28
17

A 32-bit program and operating system can only address a maximum of 32 bits of memory i.e. 4GB. There are other factors that make the total even smaller; for example, Windows reserves between 0.5 and 2GB for hardware access, and of course your program is going to take some space as well.

Edit: The obvious thing you're missing is an understanding of the mechanics of mmap, on any operating system. It allows you to map a portion of a file to a range of memory - once you've done that, any access to that portion of the file happens with the least possible overhead. It's low overhead because the mapping is done once, and doesn't have to change every time you access a different range. The drawback is that you need an open address range sufficient for the portion you're trying to map. If you're mapping the whole file at once, you'll need a hole in the memory map large enough to fit the entire file. If such a hole doesn't exist, or is bigger than your entire address space, it fails.

Mark Ransom
  • 299,747
  • 42
  • 398
  • 622
  • True, but mmap doesn't actually *need* to address all of this memory - the address space limitation is an implementation detail. Sure, if I ask for a huge slice then there may be memory problems but otherwise there's no need to reserve the memory. – Scott Griffiths Nov 02 '09 at 16:19
  • "If I ask for a huge slice" - since you used 0 for the second parameter, your "slice" is the whole file. – Mark Ransom Nov 02 '09 at 17:14
  • Yes I'm asking for the whole file, but I'm not expecting it to be read into memory unless I reference a slice of it. – Scott Griffiths Nov 02 '09 at 17:17
  • 4
    A typical mmap implementation will reserve the address space of the object you're mapping. If that mapping cannot be made - e.g. as in there's not enough space to map the requested size, mmap will fail. mmap won't actually read the entire thing until you access it. But it will try to create the address space mapping. – nos Nov 02 '09 at 17:40
9

the mmap module provides all the tools you need to poke around in your large file, but due to the limitations other folks have mentioned, you can't map it all at once. You can map a good sized chunk at once, do some processing and then unmap that and map another. the key arguments to the mmap class are length and offset, which do exactly what they sound like, allowing you to map length bytes, starting at byte offset in the mapped file. Any time you wish to read a section of memory that is outside the mapped window, you have to map in a new window.

SingleNegationElimination
  • 151,563
  • 33
  • 264
  • 304
6

The point you are missing is that mmap is a memory mapping function that maps a file into memory for arbitrary access across the requested data range by any means.

What you are looking for sounds more like some sort of a data window class that presents an api allowing you to look at small windows of a large data structure at anyone time. Access beyond the bounds of this window would not be possible other than by calling the data window's own api.

This is fine, but it is not a memory map, it is something that offers the advantage of a wider data range at the cost of a more restrictive api.

morechilli
  • 9,827
  • 7
  • 33
  • 54
4

Use a 64-bit computer, with a 64-bit OS and a 64-bit python implementation, or avoid mmap()

mmap() requires CPU hardware support to make sense with large files bigger than a few GiB.

It uses the CPU's MMU and interrupt subsystems to allow exposing the data as if it were already loaded ram.

The MMU is hardware which will generate an interrupt whenever an address corresponding to data not in physical RAM is accessed, and the OS will handle the interrupt in a way that makes sense at runtime, so the accessing code never knows (or needs to know) that the data doesn't fit in RAM.

This makes your accessing code simple to write. However, to use mmap() this way, everything involved will need to handle 64 bit addresses.

Or else it may be preferable to avoid mmap() altogether and do your own memory management.

runemoennike
  • 350
  • 2
  • 3
  • 14
RGD2
  • 443
  • 3
  • 8
2

You're setting the length parameter to zero, which means map in the entire file. On a 32 bit build, this won't be possible if the file length is more than 2GB (possibly 4GB).

R Hyde
  • 10,301
  • 1
  • 32
  • 28
  • Yes, I want to map the whole file. It seems unreasonable to restrict it to a few GB especially as I need read-only access. It seems crazy to me that mmap immediately tries to reserve GBs of memory! – Scott Griffiths Nov 02 '09 at 15:51
  • 7
    mmap'ing doesn't require physical memory - it needs *virtual address space* to make the file available. – nobody Nov 02 '09 at 16:14
  • @Andrew: Then I suppose my question is Why does it need all this virtual address space? It's easy enough to make the file behave like a string without it (especially if it's read only). Perhaps I should stress that this is about the Python mmap module, which doesn't have to have the same characteristics and restrictions as the Unix mmap system call. – Scott Griffiths Nov 02 '09 at 16:39
  • 4
    Because a pointer to a virtual address is STILL only 32-bits. 32-bits = 4GB at most. Python uses the local architecture's pointers. – jmucchiello Nov 02 '09 at 16:58
1

You ask the OS to map the entire file in a memory range. It won't be read until you trigger page faults by reading/writing, but it still needs to make sure the entire range is available to your process, and if that range is too big, there will be difficulties.

Macke
  • 24,812
  • 7
  • 82
  • 118