Using mmap to apply regex to whole file

Question

I'm trying to apply a regular expression to a whole file (not just each line independently) using the following code:

import mmap, re

ifile = open(ifilename)
data = mmap.mmap(ifile.fileno(), 0)
print data
mo = re.search('error: (.*)', data)
if mo:
    print "found error"

This is based on the answer to the question How do I re.search or re.match on a whole file without reading it all into memory?

But I'm getting the following error:

Traceback (most recent call last):
  File "./myscript.py", line 29, in ?
    mo = re.search('error: (.*)', data)
  File "/usr/lib/python2.3/sre.py", line 137, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

How can I fix this problem?

In the question Match multiline regex in file object, I found that another possibility for reading the whole file is the following, instead of the mmap object:

data = open("data.txt").read()

Any reason to prefer the mmap rather than the simple buffer/string?

I am take a guessing here that probably it is because you are using Python 2.3 — Xin Yin, Aug 12 '14 at 16:57
Is this really all of your code? Python 2.7 throws an exception with "Permission Denied" if you open it without write permissions: `open(ifilename, 'r+')` — skrrgwasme, Aug 12 '14 at 21:31
Also, I don't think this is a version issue. The Python 2.3 [docs for the mmap module](https://docs.python.org/release/2.3/lib/module-mmap.html) specifically say that you can search through a m-mapped file with the re module. — skrrgwasme, Aug 12 '14 at 21:34
@ScottLawson Does using the `data=open("data.txt").read()` method effectively do the same thing as the `mmap` method does? (When they both work.) — synaptik, Aug 13 '14 at 01:26
@synaptik I'm hesitant to say yes, but... yes. They will both allow you to search the whole file at once with a regex. But they have **entirely different** methods of achieving that, so you really should carefully consider which one is best for your situation. Is there any reason you can't upgrade at least to Python 2.7, just in case this *is* a version issue? — skrrgwasme, Aug 13 '14 at 14:45
@ScottLawson I don't administer the servers where this code needs to run, however, I discovered that 2.6 is available. So I can now use that instead of 2.3. What is the gist of the difference in the methods? (Link that explains the difference?) — synaptik, Aug 13 '14 at 16:52

score 9 · Accepted Answer · edited May 23 '17 at 12:16

You really have two questions buried in here.

Your Technical Issue

The problem you're facing will most likely be resolved if you upgrade to a newer version of Python, or you should at least get a better traceback. The mmap docs specify that you need to open a file for update to mmap it, and you're not currently doing that.

ifile = open(ifilename) # default is to open as read

Should be this:

ifile = open(ifilename, 'r+')

Or, if you can update to Python 2.6 as you mentioned in your comments,

with open(ifilename, 'r+') as fi:
    # do stuff with open file

If you don't open a file with write permissions on 2.7 and try to mmap it, a "Permission denied" exception is raised. I suspect that error was not implemented in 2.3, so now you're being allowed to continue with an invalid mmap object that fails when you try to search it with the regex.

mmap vs. open().read()

In the end, you will be able to do (almost) the same thing with both methods. re.search(pattern, mmap_or_long_string) will search either your memory mapped file or the long string that results from the read() call.

The main difference between the two methods is in Virtual vs Real Memory consumption. In a memory-mapped file, the file remains on disk (or wherever it is) and you directly access it through virtual memory addresses. When you read a file in using read(), you are bringing the whole file into (real) memory all at once.

Why One or the Other:

File Size
The most significant limit on the size of the file you can map is the size of your virtual memory address space, which is dictated by your CPU (32 or 64 bit). The memory allocated must be contiguous though, so you may have allocation errors if the OS can't find a large enough block to allocate the memory. When using read(), on the other hand, your limit is physical memory available instead. If you are accessing files larger than available memory and reading individual lines isn't an option, consider mmap.
File Sharing Among Processes
If you are parallelizing read-only operations on a large file, you can map it into memory to share it among processes instead of each process reading in a copy of the whole file.
Readability/Familiarity
Many more people are familiar with the simple open() and read() functions than memory mapping. Unless you have a compelling reason to use mmap, sticking with the basic IO functions is probably better in the long run for maintainability.
Speed
This one is a wash. A lot of forums and posts like to talk about mmap speed (because it bypasses some system calls once the file is mapped), but the underlying mechanism is still accessing a disk, while reading a whole file in brings everything into memory and only performs disk accesses at the beginning and end of working with the file. There is endless complexity if you try to account for caching (both hard disk and CPU), memory paging, and file access patterns. It is much easier to stick with the tried and true method of profiling. You will see different results based on your individual use case and access patterns for your files, so profile both and see which one is faster for you.

Other Resources

A good summary of the differences
PyMOTW
A good SO question
Wikipedia Virtual Memory article

Actually the python docs are wrong w.r.t. "open for update". It depends on the OS and permissions on the mmap object — Antti Haapala -- Слава Україні, Jun 16 '17 at 11:40

Using mmap to apply regex to whole file

1 Answers1

Your Technical Issue

mmap vs. open().read()

Other Resources

Linked