How do I re.search or re.match on a whole file without reading it all into memory?

Question

I want to be able to run a regular expression on an entire file, but I'd like to be able to not have to read the whole file into memory at once as I may be working with rather large files in the future. Is there a way to do this? Thanks!

Clarification: I cannot read line-by-line because it can span multiple lines.

Does your regex have to cross line boundaries? If not, you can just match it line-by-line. — A. Rex, Jan 18 '09 at 01:29

score 79 · Accepted Answer · answered Jan 18 '09 at 03:24

79

You can use mmap to map the file to memory. The file contents can then be accessed like a normal string:

import re, mmap

with open('/var/log/error.log', 'r+') as f:
  data = mmap.mmap(f.fileno(), 0)
  mo = re.search('error: (.*)', data)
  if mo:
    print "found error", mo.group(1)

This also works for big files, the file content is internally loaded from disk as needed.

answered Jan 18 '09 at 03:24

sth

222,467
53
283
367

This is perfect. Thank you very much, sth. – Evan Fosmark Jan 18 '09 at 03:34
4

Just a side note: if you work on a 32-bit system *and* your files could be over 1 GiB, then this method might not work. – tzot Jan 19 '09 at 01:06
The mapped files count to the "used memory" and on 32bit systems one process might only use up to 4GB, so yes, if the file gets up to something like 3GB you could start running into problems. Then it's time to switch to a 64bit processor :). – sth Jan 19 '09 at 05:49
9

This causes problems in Python 3 because of str/bytes mismatch (`TypeError: can't use a string pattern on a bytes-like object`), so your regex needs to be binary (eew) – Nick T Feb 10 '15 at 04:08
3

The pattern can be bytes-like... so just use `b'pattern'` – 3mpty Sep 14 '17 at 21:02
And if you have backslashes in your pattern, `br'\*pattern\*'` is entirely sufficient and concise. – bretmattingly Nov 01 '17 at 21:55
`expected type 'T', got mmap instead`? – insidesin Dec 05 '17 at 03:16
This only matches once for me if there are matches on all lines... – ikwyl6 Feb 07 '20 at 04:31
what's the benefit of using mmap? – Jwan622 Jun 24 '21 at 15:17

score 5 · Answer 2 · answered Jan 18 '09 at 01:42

This depends on the file and the regex. The best thing you could do would be to read the file in line by line but if that does not work for your situation then might get stuck with pulling the whole file into memory.

Lets say for example that this is your file:

Lorem ipsum dolor sit amet, consectetur
adipiscing elit. Ut fringilla pede blandit
eros sagittis viverra. Curabitur facilisis
urna ABC elementum lacus molestie aliquet.
Vestibulum lobortis semper risus. Etiam
sollicitudin. Vivamus posuere mauris eu
nulla. Nunc nisi. Curabitur fringilla fringilla
elit. Nullam feugiat, metus et suscipit
fermentum, mauris ipsum blandit purus,
non vehicula purus felis sit amet tortor.
Vestibulum odio. Mauris dapibus ultricies
metus. Cras XYZ eu lectus. Cras elit turpis,
ultrices nec, commodo eu, sodales non, erat.
Quisque accumsan, nunc nec porttitor vulputate,
erat dolor suscipit quam, a tristique justo
turpis at erat.

And this was your regex:

consectetur(?=\sadipiscing)

Now this regex uses positive lookahead and will only match a string of "consectetur" if it is immediately followed by any whitepace character and then a string of "adipiscing".

So in this example you would have to read the whole file into memory because your regex is depending on the entire file being parsed as a single string. This is one of many examples that would require you to have your entire string in memory for a particular regex to work.

I guess the unfortunate answer is that it all depends on your situation.

score 3 · Answer 3 · answered Jan 18 '09 at 02:39

3

If this is a big deal and worth some effort, you can convert the regular expression into a finite state machine which reads the file. The FSM can be of O(n) complexity which means it will be a lot faster as the file size gets big.

You will be able to efficiently match patterns that span lines in files too large to fit in memory.

Here are two places that describe the algorithm for converting a regular expression to a FSM:

answered Jan 18 '09 at 02:39

Mark Harrison

297,451
125
333
465

Some features of re module require arbitrary backtracking during search. I think these fall outside the 'formal' re definition. But indeed a significant portion of useful cases can be done as an FSM. There's a case to be made that this could be an alternate re API: feed strings in, get matches out, even if they span the input strings. Useful even if the 'matches' don't include all the matched text (simplifying 'FSM state'). Another missing feature: a way to read bytes from a file into an existing array.array('c'), so you don't need to keep allocating and freeing memory on each read. – greggo Aug 20 '14 at 17:49

score 2 · Answer 4 · edited Aug 09 '10 at 22:41

This is one way:

import re

REGEX = '\d+'

with open('/tmp/workfile', 'r') as f:
      for line in f:
          print re.match(REGEX,line)

with operator in python 2.5 takes of automatic file closure. Hence you need not worry about it.
iterator over the file object is memory efficient. that is it wont read more than a line of memory at a given time.
But the draw back of this approach is that it would take a lot of time for huge files.

Another approach which comes to my mind is to use read(size) and file.seek(offset) method, which will read a portion of the file size at a time.

import re

REGEX = '\d+'

with open('/tmp/workfile', 'r') as f:
      filesize = f.size()
      part = filesize / 10 # a suitable size that you can determine ahead or in the prog.
      position = 0 
      while position <= filesize: 
          content = f.read(part)
          print re.match(REGEX,content)
          position = position + part
          f.seek(position)

You can also combine these two there you can create generator that would return contents a certain bytes at the time and iterate through that content to check your regex. This IMO would be a good approach.

Jab · Answer 5 · 2019-02-15T05:27:00.960

1

Here's an option for you using re and mmap to find all the words in a file that doesn't build lists or load the whole file into memory.

import re
from contextlib import closing
from mmap import mmap, ACCESS_READ

with open('filepath.txt', 'r') as f:
    with closing(mmap(f.fileno(), 0, access=ACCESS_READ)) as d:
        print(sum(1 for _ in re.finditer(b'\w+', d)))

based on @sth's answer but less memory usage

edited Feb 15 '19 at 05:27

answered Feb 15 '19 at 05:02

Jab

26,853
21
75
114

1

Do you know how I can use this with the re.search function so it returns or prints the line that the pattern was found on? – ikwyl6 Feb 07 '20 at 04:42
@ikwyl6 I ended up using this method, but slightly different `re` use [in a project](https://github.com/jgstew/jgstew-recipes/blob/2d4e383aff6e914d290cc98402edf5ffc3473fe7/SharedProcessors/FileTextSearcher.py#L72-L79). – jgstew Nov 10 '21 at 15:53

score 0 · Answer 6 · answered Jan 18 '15 at 13:56

f = open(filename,'r')
  for eachline in f:
    string=re.search("(<tr align=\"right\"><td>)([0-9]*)(</td><td>)([a-zA-Z]*)(</td><td>)([a-zA-Z]*)(</td>)",eachline)
    if string:
      for i in range (2,8,2):
        add = string.group(i)
        l.append(add)

score 0 · Answer 7 · answered Jan 18 '09 at 01:44

0

For single line patterns you can iterate over the lines of the file, but for multi-line patterns, You will have to read all (or part, but that'll be hard to keep track of) of the file into memory.

answered Jan 18 '09 at 01:44

sykora

96,888
11
64
71

score 0 · Answer 8 · answered Jan 18 '09 at 01:46

0

Open the file and iterate over the lines.

fd = open('myfile')
for line in fd:
    if re.match(...,line)
        print line

answered Jan 18 '09 at 01:46

Mark Harrison

297,451
125
333
465

score 0 · Answer 9 · answered Jan 07 '20 at 18:49

Python 3: To load file as one big string use read() and decode() methods

import re, mmap


def read_search_in_file(file):
    with open('/var/log/error.log', 'r+') as f:
        data = mmap.mmap(f.fileno(), 0).read().decode("utf-8")
        error = re.search(r'error: (.*)', data)
  if error:
    return error.group(1)

How do I re.search or re.match on a whole file without reading it all into memory?

9 Answers9

Linked