4

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam
start rubbish
start wait for it...
    profit!
here end
start garbage
start second match
win. end

The desired solution should print:

start wait for it...
    profit!
here end
start second match
win. end

I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

  • actual file size: 2GB
  • occurrences of 'start': ~ 12 M, evenly distributed
  • occurences of 'end': ~800, near the end of the file.
Eero Aaltonen
  • 4,239
  • 1
  • 29
  • 41
  • 2
    Well, if you want to match between `start` and `end`, then it's normal that you get `start spam` as the beginning result... Could you clarify the behavior that you want? – lcoderre Jul 08 '14 at 19:35

4 Answers4

16

This regex should match what you want:

(start((?!start).)*?end)

Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:

re.findall('(start((?!start).)*?end)', text, re.S)

See a test here.

famousgarkin
  • 13,687
  • 5
  • 58
  • 74
1

Do it with code - basic state machine:

open = False
tmp = []
for ln in fi:
    if 'start' in ln:
        if open:
            tmp = []
        else:
            open = True

    if open:
        tmp.append(ln)

    if 'end' in ln:
        open = False
        for x in tmp:
            print x
        tmp = []
gkusner
  • 1,244
  • 1
  • 11
  • 14
0

This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.

https://pypi.python.org/pypi/regex

You'd want to use something like

regex.findall(pattern, string, overlapped=True)

If you're stuck with Python 2.x or something else that doesn't have regex, it's still possible with some trickery. One brilliant person solved it here:

Python regex find all overlapping matches?

Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.

Community
  • 1
  • 1
TheSoundDefense
  • 6,753
  • 1
  • 30
  • 42
  • I added some information on the actual size of the log file. In this case, storing all overlapping matches would exceed the disk space of my computer. – Eero Aaltonen Jul 09 '14 at 12:18
  • Well, the solution I linked to returns an iterator, so you wouldn't actually need to store all overlapping matches, just one or two at a time. But given the format of the file you're trying to parse, the accepted solution is probably better for your purposes. – TheSoundDefense Jul 09 '14 at 14:04
0

You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in "end".

David Ehrmann
  • 7,366
  • 2
  • 31
  • 40