I'm working on a parser for a specific type of file that is broken up into sections by some header keyword followed a bunch of heterogeneous data. Headers are always separated by blank lines. Something along the lines of the following:
Header_A
1 1.02345
2 2.97959
...
Header_B
1 5.1700 10.2500
2 5.0660 10.5000
...
Every header contains very different types of data and depending on certain keywords within a block, the data must be stored in different locations. The general approach I took is to have some regex that catches all of the keywords that can define a header and then iterate through the lines in the file. Once I find a match, I pop lines until I reach a blank line, storing all of the data from lines in the appropriate locations.
This is the basic structure of the code where "do stuff with current_line" will involve a bunch of branches depending on what the line contains:
headers = re.compile(r"""
((?P<header_a>Header_A)
|
(?P<header_b>Header_B))
""", re.VERBOSE)
i = 0
while i < len(data_lines):
match = header.match(data_lines[i])
if match:
if match.group('header_a'):
data_lines.pop(i)
data_lines.pop(i)
# not end of file not blank line
while i < len(data_lines) and data_lines[i].strip():
current_line = data_lines.pop(i)
# do stuff with current_line
elif match.group('header_b'):
data_lines.pop(i)
data_lines.pop(i)
while i < len(data_lines) and data_lines[i].strip():
current_line = data_lines.pop(i)
# do stuff with current_line
else:
i += 1
else:
i += 1
Everything works correctly but it amounts to a highly branched structure that I find to be highly illegible and likely hard to follow for anyone unfamiliar with the code. It also makes it more difficult to keep lines at <79 characters and more generally doesn't feel very pythonic.
One thing I'm working on is separating the branch for each header into separate functions. This will hopefully improve readability quite a bit but...
...is there a cleaner way to perform the outer looping/matching structure? Maybe using itertools?
Also for various reasons this code must be able to run in 2.7.