4

I'm trying to apply a regular expression to a whole file (not just each line independently) using the following code:

import mmap, re

ifile = open(ifilename)
data = mmap.mmap(ifile.fileno(), 0)
print data
mo = re.search('error: (.*)', data)
if mo:
    print "found error"

This is based on the answer to the question How do I re.search or re.match on a whole file without reading it all into memory?

But I'm getting the following error:

Traceback (most recent call last):
  File "./myscript.py", line 29, in ?
    mo = re.search('error: (.*)', data)
  File "/usr/lib/python2.3/sre.py", line 137, in search
    return _compile(pattern, flags).search(string)
TypeError: expected string or buffer

How can I fix this problem?


In the question Match multiline regex in file object, I found that another possibility for reading the whole file is the following, instead of the mmap object:

data = open("data.txt").read()

Any reason to prefer the mmap rather than the simple buffer/string?

martineau
  • 119,623
  • 25
  • 170
  • 301
synaptik
  • 8,971
  • 16
  • 71
  • 98
  • I am take a guessing here that probably it is because you are using Python 2.3 – Xin Yin Aug 12 '14 at 16:57
  • Is this really all of your code? Python 2.7 throws an exception with "Permission Denied" if you open it without write permissions: `open(ifilename, 'r+')` – skrrgwasme Aug 12 '14 at 21:31
  • Also, I don't think this is a version issue. The Python 2.3 [docs for the mmap module](https://docs.python.org/release/2.3/lib/module-mmap.html) specifically say that you can search through a m-mapped file with the re module. – skrrgwasme Aug 12 '14 at 21:34
  • @ScottLawson Does using the `data=open("data.txt").read()` method effectively do the same thing as the `mmap` method does? (When they both work.) – synaptik Aug 13 '14 at 01:26
  • @synaptik I'm hesitant to say yes, but... yes. They will both allow you to search the whole file at once with a regex. But they have **entirely different** methods of achieving that, so you really should carefully consider which one is best for your situation. Is there any reason you can't upgrade at least to Python 2.7, just in case this *is* a version issue? – skrrgwasme Aug 13 '14 at 14:45
  • @ScottLawson I don't administer the servers where this code needs to run, however, I discovered that 2.6 is available. So I can now use that instead of 2.3. What is the gist of the difference in the methods? (Link that explains the difference?) – synaptik Aug 13 '14 at 16:52

1 Answers1

9

You really have two questions buried in here.

Your Technical Issue

The problem you're facing will most likely be resolved if you upgrade to a newer version of Python, or you should at least get a better traceback. The mmap docs specify that you need to open a file for update to mmap it, and you're not currently doing that.

ifile = open(ifilename) # default is to open as read

Should be this:

ifile = open(ifilename, 'r+')

Or, if you can update to Python 2.6 as you mentioned in your comments,

with open(ifilename, 'r+') as fi:
    # do stuff with open file

If you don't open a file with write permissions on 2.7 and try to mmap it, a "Permission denied" exception is raised. I suspect that error was not implemented in 2.3, so now you're being allowed to continue with an invalid mmap object that fails when you try to search it with the regex.

mmap vs. open().read()

In the end, you will be able to do (almost) the same thing with both methods. re.search(pattern, mmap_or_long_string) will search either your memory mapped file or the long string that results from the read() call.

The main difference between the two methods is in Virtual vs Real Memory consumption. In a memory-mapped file, the file remains on disk (or wherever it is) and you directly access it through virtual memory addresses. When you read a file in using read(), you are bringing the whole file into (real) memory all at once.

Why One or the Other:

  1. File Size
    The most significant limit on the size of the file you can map is the size of your virtual memory address space, which is dictated by your CPU (32 or 64 bit). The memory allocated must be contiguous though, so you may have allocation errors if the OS can't find a large enough block to allocate the memory. When using read(), on the other hand, your limit is physical memory available instead. If you are accessing files larger than available memory and reading individual lines isn't an option, consider mmap.

  2. File Sharing Among Processes
    If you are parallelizing read-only operations on a large file, you can map it into memory to share it among processes instead of each process reading in a copy of the whole file.

  3. Readability/Familiarity
    Many more people are familiar with the simple open() and read() functions than memory mapping. Unless you have a compelling reason to use mmap, sticking with the basic IO functions is probably better in the long run for maintainability.

  4. Speed
    This one is a wash. A lot of forums and posts like to talk about mmap speed (because it bypasses some system calls once the file is mapped), but the underlying mechanism is still accessing a disk, while reading a whole file in brings everything into memory and only performs disk accesses at the beginning and end of working with the file. There is endless complexity if you try to account for caching (both hard disk and CPU), memory paging, and file access patterns. It is much easier to stick with the tried and true method of profiling. You will see different results based on your individual use case and access patterns for your files, so profile both and see which one is faster for you.

Other Resources

A good summary of the differences
PyMOTW
A good SO question
Wikipedia Virtual Memory article

Community
  • 1
  • 1
skrrgwasme
  • 9,358
  • 11
  • 54
  • 84