20

I have a python script which read a file line by line and look if each line matches a regular expression.

I would like to improve the performance of that script by using memory map the file before I search. I have looked into mmap example: http://docs.python.org/2/library/mmap.html

My question is how can I mmap a file when it is too big (15GB) for the memory of my machine (4GB)

I read the file like this:

fi = open(log_file, 'r', buffering=10*1024*1024)

for line in fi: 
    //do somemthong

fi.close()

Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?

Thank you.

michael
  • 106,540
  • 116
  • 246
  • 346
  • 2
    make sure the IO is performance bottleneck and not the regex search or some other operation in your program. – jfs Jan 12 '13 at 02:19
  • @J.F.Sebastian: +1. Doing a Python loop with an `re.search` (and an implicit `find('\n')` equivalent) once per 80 characters, vs. once per 10MB, may be enough CPU work to dwarf the IO costs and make the rest of the question irrelevant. I updated my answer to take that into account. – abarnert Jan 12 '13 at 02:38

2 Answers2

41

First, the memory of your machine is irrelevant. It's the size of your process's address space that's relevant. With a 32-bit Python, this will be somewhere under 4GB. With a 64-bit Python, it will be more than enough.

The reason for this is that mmap isn't about mapping a file into physical memory, but into virtual memory. An mmapped file becomes just like a special swap file for your program. Thinking about this can get a bit complicated, but the Wikipedia links above should help.

So, the first answer is "use a 64-bit Python". But obviously that may not be applicable in your case.

The obvious alternative is to map in the first 1GB, search that, unmap it, map in the next 1GB, etc. The way you do this is by specifying the length and offset parameters to the mmap method. For example:

m = mmap.mmap(f.fileno(), length=1024*1024*1024, offset=1536*1024*1024)

However, the regex you're searching for could be found half-way in the first 1GB, and half in the second. So, you need to use windowing—map in the first 1GB, search, unmap, then map in a partially-overlapping 1GB, etc.

The question is, how much overlap do you need? If you know the maximum possible size of a match, you don't need anything more than that. And if you don't know… well, then there is no way to actually solve the problem without breaking up your regex—if that isn't obvious, imagine how you could possibly find a 2GB match in a single 1GB window.

Answering your followup question:

Since I set the buffer to 10MB, in terms of performance, is it the same as I mmap 10MB of file?

As with any performance question, if it really matters, you need to test it, and if it doesn't, don't worry about it.

If you want me to guess: I think mmap may be faster here, but only because (as J.F. Sebastian implied) looping and calling re.match 128K times as often may cause your code to be CPU-bound instead of IO-bound. But you could optimize that away without mmap, just by using read. So, would mmap be faster than read? Given the sizes involved, I'd expect the performance of mmap to be much faster on old Unix platforms, about the same on modern Unix platforms, and a bit slower on Windows. (You can still get large performance benefits out of mmap over read or read+lseek if you're using madvise, but that's not relevant here.) But really, that's just a guess.

The most compelling reason to use mmap is usually that it's simpler than read-based code, not that it's faster. When you have to use windowing even with mmap, and when you don't need to do any seeking with read, this is less compelling, but still, if you try writing the code both ways, I'd expect your mmap code would end up a bit more readable. (Especially if you tried to optimize out the buffer copies from the obvious read solution.)

abarnert
  • 354,177
  • 51
  • 601
  • 671
  • 1
    Good answer, but IMHO could have been even better if you'd explained _why_ it is address space rather than amount of physical memory in the machine that matters. – martineau Jan 12 '13 at 02:53
  • @martineau: Good idea. I tried, but… it's tricky to explain this stuff, especially if you don't know how much the audience already knows. If the last paragraph is an unreadable mess, I wouldn't be too surprised. Any suggestions for improvement? – abarnert Jan 12 '13 at 20:16
  • What you added seems a little too detailed. I guess it could be summed up into it's address space that matters more because that's how much memory your computer can access without using some trickery, and this already has to hold your operating system, running programs, and non-memory-mapped data. If you memory-map a file, the whole thing will need to fit into what's free within that space. On a 32-bit system there's just no way to fit a 15 GB file into the < 4 GB address space available, so directly memory-mapping the while file is simply impossible. – martineau Jan 12 '13 at 21:02
  • @martineau: Um… what you just summed up is wrong. I'm not sure whether you're talking about physical memory, or about master page table space, but neither of those is relevant here. The address space is how much memory _each process_ can access, not the computer as a whole. Other running programs and their data are irrelevant. And so are 32- vs. 64-bit systems—if you run a 32-bit process on a 64-bit system, it still only has a 4GB address space. – abarnert Jan 12 '13 at 21:27
  • My example was about the 4 GB [virtual address space](http://en.wikipedia.org/wiki/Virtual_address_space#Example) available to processes running under a 32-bit OS. I consider things like the page table part of the OS. I may have said "computer" instead of the more-correct term "process". The illustrations in the linked article show what I was trying to describe -- namely that everything, including [memory-mapped](http://en.wikipedia.org/wiki/Memory-mapped_file) files, would have to fit within 4 GB in such a scenario. – martineau Jan 13 '13 at 01:02
  • @martineau: First, each process (that is, each program) has its own 4GB virtual address space. This isn't just a minor technicality—you have dozens of programs running on your computer, and each one has 4GB, so the idea that "other running programs" have to fit into your program's space is wrong. Second, the 32-bit OS doesn't matter. Many people run 32-bit Python on 64-bit Windows (and some on Linux; very few on OS X nowadays, but it's still possible). So, despite having a 64-bit OS, they still have the same 4GB limit. – abarnert Jan 13 '13 at 04:34
  • @martineau: … but the link does look like it'll be pretty useful to others trying to understand. I added a couple more words and a lot more links to the "short version" near the top, and scrapped my attempt to explain all the details at the bottom (because I was clearly making a mess of it). – abarnert Jan 13 '13 at 04:39
  • you're assuming that python 64bit can mmap() more than 4G. Are you sure about that? (answer: no it can't, it's also limited to 4GB file size!) – Eric May 28 '20 at 13:27
1

I came to try using mmap because I used fileh.readline() on a file being dozens of GB in size and wanted to make it faster. Unix strace utility seems to reveal that the file is read in 4kB chunks now, and at least the output from strace seems to me printed slowly and I know parsing the file takes many hours.

$ strace -v -f -p 32495
Process 32495 attached
read(5, "blah blah blah foo bar xxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
read(5, "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"..., 4096) = 4096
^CProcess 32495 detached
$

This thread is so far the only explaining me I should not try to mmap a too large file. I do not understand why isn't there already a helper function like mmap_for_dummies(filename) which would do internally os.path.size(filename) and then either doing normal open(filename, 'r', buffering=10*1024*1024) or doing mmap.mmap(open(filename).fileno()). I certainly want to avoid fiddling with sliding window approach myself but would the function do a simple decision whether to do mmap or not would be enough for me.

Finally to mention, it is still not clear to me why some examples on the internet mention open(filename, 'rb') without explanation (e.g. https://docs.python.org/2/library/mmap.html). Provided one often wants to use the file in a for loop with .readline() call I do not know if I should open in 'rb' or just 'r' mode (I guess it is necessary to preserve the '\n').

Thanks for mentioning the buffering=10*1024*1024) argument, is probably more helpful than changing my code to gain some speed.

Martin
  • 79
  • 1
  • 3