I am parsing log files that include lines regarding events by many jobs, identified by a job id. I am trying to get all lines in a log file between two patterns in Python.
I have read this very useful post How to select lines between two patterns? and had already solved the problem with awk
like so:
awk '/pattern1/,/pattern2/' file
Since I am processing the log information in a Python script, I am using subprocess.Popen()
to execute that awk command. My program works, but I would like to solve this using Python alone.
I know of the re
module, but don't quite understand how to use it. The log files have already been compressed to bz2, so this is my code to open the .bz2 files and find the lines between the two patterns:
import bz2
import re
logfile = '/some/log/file.bz2'
PATTERN = r"/{0}/,/{1}/".format('pattern1', 'pattern2')
# example: PATTERN = r"/0001.server;Considering job to run/,/0040;pbs_sched;Job;0001.server/"
re.compile(PATTERN)
with bz2.BZ2File(logfile) as fh:
match = re.findall(PATTERN, fh.read())
However, match
is empty (fh.read()
is not!). Using re.findall(PATTERN, fh.read(), re.MULTILINE)
has no effect.
Using re.DEBUG
after re.compile()
shows many lines with
literal 47
literal 50
literal 48
literal 49
literal 57
and two say
any None
I could solve the problem with loops like here python print between two patterns, including lines containing patterns but I avoid nested for-if loops as much as I can. I belive the re
module can yield the result I want but I am no expert in how to use it.
I am using Python 2.7.9.