UPDATE: Here is a re-engineered version of the original idea using a class and a somewhat different (and I hope cleaner) approach to parsing the data.
Assumptions:
- The data starts with a line like 'SITE_COUNT: nnn' where nnn is the number of site sections.
- Each site section is preceded by a line starting with '{' and is followed by a line starting with '}'.
- Header and site data is stored in arrays. It should be easy to change the code to store and return the data in other data types such as dict's.
The call Parse(f) in the code below creates a Parse object which parses the entire stream f and exposes methods to obtain the header and the (multiple) site data captured. (See the last few lines of the code).
For the sample input data below, the output is:
Site count is 2
Processing site 0
Processing site 1
Finished processing.
The headers are ['PRB_CARD: 108', 'PRB_FACILITY: LEHI']
Site 0 data is ['1000:1665.67', '1007:12.0638', '1011:27.728']
Site 1 data is ['1003:10.0677']
Code:
import re
# Define a class
class Parse():
headers = []
site_data = []
def __init__(self, f):
self.f = f
# Capture the number of sites in the file.
self.readline()
m = re.match(r'SITE_COUNT: (\d+)', self.line)
if m:
self.site_count = int(m.group(1))
print('Site count is', self.site_count)
else:
raise Exception('Invalid input file format')
self.headers = self.capture_headers()
for i in range(self.site_count):
print('Processing site', i)
self.site_data.append(self.capture_site())
print('Finished processing.')
def capture_headers(self):
headers = []
while self.readline():
if self.line.startswith('{'):
break
if self.line.startswith('PRB_'):
headers.append(self.line)
return headers
def capture_site(self):
pat = re.compile('(\d+:\d*\.\d*):VALID')
data = []
while self.readline():
if self.line.startswith('}'):
break
m = pat.match(self.line)
if m:
data.append(m.group(1))
return data
def get_headers(self):
return self.headers
def get_site_count(self):
return self.site_count
def get_site_data(self, i):
return self.site_data[i]
def readline(self):
self.line = f.readline().rstrip('\n\r')
return not self.line.startswith('FINISH_ETIME:') # Returns False at the end
# Run against the (slightly modified) data:
f = open('data.log')
p = Parse(f)
print('The headers are', p.get_headers())
for i in range(p.get_site_count()):
print('Site', i, 'data is', p.get_site_data(i))
And the input is: (note the site count in the first line!)
SITE_COUNT: 2
PRB_CARD: 108
PRB_FACILITY: LEHI
PROCESS_ID: 88AFX
TEMP: 0
DATA_SET_ID: _P9kTbjdptOyonKO_
START_DATETIME: 05/01/2020 03:06:24
LOT: 0522072.0
.
.
.
+ 1 1588323984 1NA:0NN
{
Head(1) Site(0) (X,Y)=(-4,16)
VALID_COUNT 712
*Z
SITE:PARAM_SITE:N3P3
SITE:PROCESS_DURATION:81
1000:1665.67:VALID
.
.
1007:12.0638:VALID
1011:27.728:VALID
.
.
NUM_REGISTERS 712
NUM_VALID 6787
NUM_TESTED 6787
}
.
.
.
+ 2 1585959359 1NA:0NN
{
Head(1) Site(1) (X,Y)=(-2,4)
VALID_COUNT 583
*Z
SITE:PARAM_SITE:N2N3
SITE:PROCESS_DURATION:286
1003:10.0677:VALID
.
.
.
}
FINISH_ETIME: 1588324881