4

Let's say I have a really large file foo.txt and I want to iterate through it doing something upon finding a regular expression. Currently I do this:

f = open('foo.txt')
s = f.read()
f.close()
for m in re.finditer(regex, s):
    doSomething()

Is there a way to do this without having to store the entire file in memory?

NOTE: Reading the file line by line is not an option because the regex can possibly span multiple lines.

UPDATE: I would also like this to work with stdin if possible.

UPDATE: I am considering somehow emulating a string object with a custom file wrapper but I am not sure if the regex functions would accept a custom string-like object.

Matt
  • 21,026
  • 18
  • 63
  • 115
  • 1
    FWIW a trivial test I just did seems to indicate that `re.finditer()` _will_ accept a subclass of `str`. Therefore, it sounds possible to use that approach if you can determine and emulate all the `str` methods that `finditer()` uses. – martineau Jun 21 '12 at 14:46
  • Another possibility would be to subclass the [`UserString`](http://docs.python.org/library/userdict.html#module-UserString) class wrapper (or use it as template for your own class). Source for it should be in the ../lib/UserString.py file of your Python installation. Everything appears to revolve around manipulating `self.data` attribute internally, so maybe you could hook into that in your own subclass. If nothing else it looks like a good implementation guide for a custom `str` subclass. – martineau Jun 21 '12 at 21:22
  • @martineau I think the whole purpose of `UserString` was to subclass `str` back when builtins could not be subclassed. – Matt Jun 21 '12 at 21:34
  • That's probably true, but it still exists and if nothing else it looks useful as a compact guide to the methods needed...to perhaps implement something based on one of the answers below to retrieve the data. – martineau Jun 22 '12 at 00:03

4 Answers4

5

Either you will have to read the file chunk-wise, with overlaps to allow for the maximum possible length of the expression, or use an mmapped file, which will work almost/just as good as using a stream: https://docs.python.org/library/mmap.html

UPDATE to your UPDATE: consider that stdin isn't a file, it just behaves a lot like one in that it has a file descriptor and so on. it is a posix stream. if you are unclear on the difference, do some googling around. the OS cannot mmap it, therefore python can not.

also consider that what you're doing may be an ill-suited thing to use a regex for. regex's are great for capturing small stuff, like parsing a connection string, a log entry, csv data and so on. they are not a good tool to parse through huge chunks of data. this is by design. you may be better off writing a custom parser.

some words of wisdom from the past: http://regex.info/blog/2006-09-15/247

twasbrillig
  • 17,084
  • 9
  • 43
  • 67
Ben Kunz
  • 136
  • 4
  • Thanks, unfortunately this does not work with special files like `stdin`. Is there a way that also works with `stdin`? – Matt Jun 20 '12 at 20:52
  • 1
    short answer: no, you can not mmap stdin. long answer: consider how regular expressions work, with backtracking and all that seeking around that they do. now consider doing this on a stream. does it still sound like a good idea? =D similiarly, mmapping stdin would imho not make much sense and open a whole can of problems – Ben Kunz Jun 20 '12 at 21:08
  • I don't think regular expressions ever need to backtrack, unless they are not regular. I know you can't mmap stdin or any stream, that's why I asked if there is another way. – Matt Jun 26 '12 at 21:48
  • Also, see this related question for an example of implementation using `mmap`: [How do I re.search or re.match on a whole file without reading it all into memory?](https://stackoverflow.com/a/454589/2291710) – Delgan May 07 '18 at 20:45
5

If you can limit the number of lines that the regex can span to some reasonable number, then you can use a collections.deque to create a rolling window on the file and keep only that number of lines in memory.

from collections import deque

def textwindow(filename, numlines):
    with open(filename) as f:
        window   = deque((f.readline() for i in xrange(numlines)), maxlen=numlines)
        nextline = True
        while nextline:
            text = "".join(window)
            yield text
            nextline = f.readline()
            window.append(nextline)

 for text in textwindow("bigfile.txt", 10):
     # test to see whether your regex matches and do something
kindall
  • 178,883
  • 35
  • 278
  • 309
  • I can't really limit the number of lines since the regexes may contain things like `'\n*'` for example. – Matt Jun 20 '12 at 21:05
  • 2
    In that case, yeahp, you're gonna have to read the whole file. – kindall Jun 20 '12 at 21:10
  • @Matt: Surely even if something like `'\n*'` is allowed there's got to be some kind of practical/realistic/real-world limit could be put into place...for example are `n` newlines really significantly different from `n+1` of them for a large values of `n`? – martineau Jun 20 '12 at 23:59
  • @martineau Of course I could choose an arbitrarily large limit, but this would still use a lot of memory even for small matches, which is exactly what I am trying to avoid. I want to find a solution that only stores what the `re` module needs at that moment. I was hoping there would be some way to pass an iterator to the `re.finditer` function but there does not appear to be a way to do that. It also seems that `re` does not follow the duck typing pattern and checks specifically if the string is a `str`, otherwise I could emulate a string with my own class. – Matt Jun 21 '12 at 14:10
  • Unfortunately, the regular expression engine (like all regular expression engines) is optimized for performance, so it's not going to call your iterator each time it needs a new line. Even if it did, and you wrote `\n*`, you could still end up with the whole file in memory (since the regex might need to backtrack). – kindall Jun 21 '12 at 14:41
  • @kindall When does a regex need to backtrack? – Matt Jun 26 '12 at 21:50
  • Any match with a greedy quantifier might need to backtrack. Sometimes this need can be optimized away, sometimes not. Consider how `(ab)*c` against would match against `abababababababd`, for example. – kindall Jun 26 '12 at 22:12
  • I don't see how your example needs to backtrack. This one however may look like it needs to: `(ab)*[abc][abc][abc]` matched against `ababababc`. But if the matching is implemented using NFAs, then no backtracking is necessary. – Matt Jun 28 '12 at 19:45
0

Perhaps you could write a function that yields one line (reads one line) at a time of the file and call re.finditer on that until it yields an EOF signal.

zallarak
  • 5,287
  • 7
  • 38
  • 54
  • What if the regex spans multiple lines? – Matt Jun 20 '12 at 20:38
  • Perhaps you could modify the yield function to yield 2 or 3 (or any number that seems reasonable) lines at a time, and for every iteration, push one new line to it and remove the latest line to control for that. – zallarak Jun 20 '12 at 21:05
0

Here is another solution, using an internal text buffer to progressively yield found matches without loading the entire file in memory.

This buffer acts like a "sliding windows" through the file text, moving forward while yielding found matches.

As the file content is loaded by chunks, this means this solution works with multilines regexes too.

def find_chunked(fileobj, regex, *, chunk_size=4096):
    buffer = ""

    while 1:
        text = fileobj.read(chunk_size)
        buffer += text
        matches = list(regex.finditer(buffer))

        # End of file, search through remaining final buffer and exit
        if not text:
            yield from matches
            break

        # Yield found matches except the last one which is maybe 
        # incomplete because of the chunk cut (think about '.*')
        if len(matches) > 1:
            end = matches[-2].end()
            buffer = buffer[end:]
            yield from matches[:-1]

However, note that it may end up loading the whole file in memory if no matches are found at all, so you better should use this function if you are confident that your file contains the regex pattern many times.

Delgan
  • 18,571
  • 11
  • 90
  • 141