Match lines between two patterns in Python with regular expressions

Question

I am parsing log files that include lines regarding events by many jobs, identified by a job id. I am trying to get all lines in a log file between two patterns in Python.

I have read this very useful post How to select lines between two patterns? and had already solved the problem with awk like so:

awk '/pattern1/,/pattern2/' file

Since I am processing the log information in a Python script, I am using subprocess.Popen() to execute that awk command. My program works, but I would like to solve this using Python alone.

I know of the re module, but don't quite understand how to use it. The log files have already been compressed to bz2, so this is my code to open the .bz2 files and find the lines between the two patterns:

import bz2
import re

logfile = '/some/log/file.bz2'

PATTERN = r"/{0}/,/{1}/".format('pattern1', 'pattern2')
# example: PATTERN = r"/0001.server;Considering job to run/,/0040;pbs_sched;Job;0001.server/"
re.compile(PATTERN)

with bz2.BZ2File(logfile) as fh:
    match = re.findall(PATTERN, fh.read())

However, match is empty (fh.read() is not!). Using re.findall(PATTERN, fh.read(), re.MULTILINE) has no effect. Using re.DEBUG after re.compile() shows many lines with

literal 47
literal 50
literal 48
literal 49
literal 57

and two say

any None

I could solve the problem with loops like here python print between two patterns, including lines containing patterns but I avoid nested for-if loops as much as I can. I belive the re module can yield the result I want but I am no expert in how to use it.

I am using Python 2.7.9.

your `/0001.server;Considering job to run/,/0040;pbs_sched;Job;0001.server/` does not contain newlines, while to use `awk` you *must* have them, so what is a *realistic* logfile content? — Walter Tross, Jan 03 '19 at 13:26
That is the regular expression, not the log file. For awk, I would use `awk '/0001.server;Considering job to run/,/0040;pbs_sched;Job;0001.server/' logfile` and the result is the lines between those two patterns contained in the logfile. I could post the log, but I'd better anonymise some strings, because it contains customer-related information. — Bubastis, Jan 03 '19 at 13:45

Walter Tross · Accepted Answer · 2019-01-26T10:54:39.370

It's usually a bad idea to read a whole log file into memory, so I'll give you a line-by-line solution. I'll assume that the dots you have in your example are the only varying part of the pattern. I'll also assume that you want to collect line groups in a list of lists.

import bz2
import re

with_delimiting_lines = True
logfile = '/some/log/file.bz2'
group_start_regex = re.compile(r'/0001.server;Considering job to run/')
group_stop_regex  = re.compile(r'/0040;pbs_sched;Job;0001.server/')
group_list = []
with bz2.BZ2File(logfile) if logfile.endswith('.bz2') else open(logfile) as fh:
    inside_group = False
    for line_with_nl in fh:
        line = line_with_nl.rstrip()
        if inside_group:
            if group_stop_regex.match(line):
                inside_group = False
                if with_delimiting_lines:
                    group.append(line)
                group_list.append(group)
            else:
                group.append(line)
        elif group_start_regex.match(line):
            inside_group = True
            group = []
            if with_delimiting_lines:
                group.append(line)

Please note that match() matches from the beginning of the line (as if the pattern started with ^, when re.MULTILINE mode is off)

Thank you @Walter. The job id is ine `0001` in my example. I could have been clearer in my question. — Bubastis, Jan 03 '19 at 14:47
also in the above comment... :-) but never mind. Just tweak the regexes to your needs — Walter Tross, Jan 03 '19 at 14:55

Aaron · Answer 2 · 2019-01-07T18:12:03.657

2

/pattern1/,/pattern2/ isn't a regex, it's a construct specific to awk which is composed of two regexs.

With pure regex you could use pattern1.*?pattern2 with the DOTALL flag (which makes . match newlines when it usually wouldn't) :

re.findall("pattern1.*?pattern2", input, re.DOTALL)

It differs from the awk command which will match the full lines containing the start and end pattern ; this could be achieved as follows :

re.findall("[^\n]*pattern1.*?pattern2[^\n]*", input, re.DOTALL)

Try it here !

Note that I answered your question as it was asked for the sake of pedagogy, but Walter Tross' solution should be preferred.

edited Jan 07 '19 at 18:12

answered Jan 03 '19 at 14:07

Aaron

24,009
2
33
57

Thank you Aaron. This is almost exactly what I asked for. However, the pattern1 and pattern2 occur many times in the log. The awk command matches all lines between those patterms. I tried your solution in ideone with with the input containing several such blocks, and non-relevant blocks, but it matches all lines after the first pattern1, also all those lines after pattern2. – Bubastis Jan 03 '19 at 14:30
@Bubastis right, I should have seen it coming. That can be fixed by using the reluctant version of the `*` quantifier. I've updated my answer and the ideone test :) – Aaron Jan 03 '19 at 14:42
I am unsure of which answer to select as the correct one. Walter's solution should be preferred, but the answer more close to the question is yours. – Bubastis Jan 09 '19 at 08:43
@Bubastis I think Walter's should be accepted :) Not sure you can accept any solution since a duplicate has been linked however – Aaron Jan 09 '19 at 11:44

Match lines between two patterns in Python with regular expressions

2 Answers2