Find shortest matches between two strings

Question

I have a large log file, and I want to extract a multi-line string between two strings: start and end.

The following is sample from the inputfile:

start spam
start rubbish
start wait for it...
    profit!
here end
start garbage
start second match
win. end

The desired solution should print:

start wait for it...
    profit!
here end
start second match
win. end

I tried a simple regex but it returned everything from start spam. How should this be done?

Edit: Additional info on real-life computational complexity:

actual file size: 2GB
occurrences of 'start': ~ 12 M, evenly distributed
occurences of 'end': ~800, near the end of the file.

Well, if you want to match between `start` and `end`, then it's normal that you get `start spam` as the beginning result... Could you clarify the behavior that you want? — lcoderre, Jul 08 '14 at 19:35

famousgarkin · Accepted Answer · 2014-07-08T20:14:52.097

16

This regex should match what you want:

(start((?!start).)*?end)

Use re.findall method and single-line modifier re.S to get all the occurences in a multi-line string:

re.findall('(start((?!start).)*?end)', text, re.S)

See a test here.

edited Jul 08 '14 at 20:14

answered Jul 08 '14 at 19:40

famousgarkin

13,687
5
58
74

2

Why have I never heard of regex101 before...? – RevanProdigalKnight Jul 08 '14 at 20:24
Good answer and demo on regex101. The key that I was missing was the negative lookahead. Really useful. – Eero Aaltonen Jul 09 '14 at 09:25
Working in JS as well. – semanser Aug 11 '17 at 09:33
Could you explain `((?!start).)`? – roschach Jan 27 '19 at 10:32
@FrancescoBoi See [Tempered Greedy Token - What is different about placing the dot before the negative lookahead](https://stackoverflow.com/a/37343088/3832970). – Wiktor Stribiżew Aug 14 '19 at 13:54
In case you start having performance issues using this pattern use `re.findall(r'(start([^se]*(?:s(?!tart)[^se]*|e(?!nd)[^se]*)*end)', text)` – Wiktor Stribiżew Aug 28 '19 at 18:33

score 1 · Answer 2 · answered Jul 08 '14 at 19:49

1

Do it with code - basic state machine:

open = False
tmp = []
for ln in fi:
    if 'start' in ln:
        if open:
            tmp = []
        else:
            open = True

    if open:
        tmp.append(ln)

    if 'end' in ln:
        open = False
        for x in tmp:
            print x
        tmp = []

answered Jul 08 '14 at 19:49

gkusner

1,244
1
11
14

Perfectly valid also. – Eero Aaltonen Jul 09 '14 at 10:42

score 0 · Answer 3 · edited May 23 '17 at 12:18

0

This is tricky to do because by default, the re module does not look at overlapping matches. Newer versions of Python have a new regex module that allows for overlapping matches.

https://pypi.python.org/pypi/regex

You'd want to use something like

regex.findall(pattern, string, overlapped=True)

If you're stuck with Python 2.x or something else that doesn't have regex, it's still possible with some trickery. One brilliant person solved it here:

Python regex find all overlapping matches?

Once you have all possible overlapping (non-greedy, I imagine) matches, just determine which one is shortest, which should be easy.

edited May 23 '17 at 12:18

Community

1
1

answered Jul 08 '14 at 19:38

TheSoundDefense

6,753
1
30
42

I added some information on the actual size of the log file. In this case, storing all overlapping matches would exceed the disk space of my computer. – Eero Aaltonen Jul 09 '14 at 12:18
Well, the solution I linked to returns an iterator, so you wouldn't actually need to store all overlapping matches, just one or two at a time. But given the format of the file you're trying to parse, the accepted solution is probably better for your purposes. – TheSoundDefense Jul 09 '14 at 14:04

score 0 · Answer 4 · answered Jul 08 '14 at 19:42

0

You could do (?s)start.*?(?=end|start)(?:end)?, then filter out everything not ending in "end".

answered Jul 08 '14 at 19:42

David Ehrmann

7,366
2
31
40

Find shortest matches between two strings

4 Answers4

Linked

Related