-1

I have log files of the form:

SITE_COUNT: 11
PRB_CARD: 108
PRB_FACILITY: LEHI
PROCESS_ID: 88AFX
TEMP: 0
DATA_SET_ID: _P9kTbjdptOyonKO_
START_DATETIME: 05/01/2020 03:06:24
LOT: 0522072.0
.
.
.
+ 1 1588323984 1NA:0NN
{
Head(1) Site(0) (X,Y)=(-4,16)
VALID_COUNT 712

*Z
SITE:PARAM_SITE:N3P3
SITE:PROCESS_DURATION:81
1000:1665.67:VALID
.
.
1007:12.0638:VALID
1011:27.728:VALID
.
.
NUM_REGISTERS 712
NUM_VALID 6787
NUM_TESTED 6787
}
.
.
.
+ 2 1585959359 1NA:0NN
{
Head(1) Site(0) (X,Y)=(-2,4)
VALID_COUNT 583

*Z
SITE:PARAM_SITE:N2N3
SITE:PROCESS_DURATION:286
1003:10.0677:VALID
.
.
.
FINISH_ETIME: 1588324881

As you can see from the sample, the file starts with a section that has headers like PRB_CARD, PRB_FACILTY. These headers are typically in the first 50 lines of each file so I have a list comprhension that only captures the first 50 lines of the file and feed those into dictionaries from which I capture the key value pairs I need from the headers section

My issue now are the lines under each Head(x) Site(x) section. Each one Head(x) section has multiple lines often around the count of 800 lines. I need to capture each section and put them in a table and have my script move on to the next section and capture those as well. Each Head(x) section needs to be captured seperately.

How can I do this?

edo101
  • 629
  • 6
  • 17

3 Answers3

0

You could try a regex split:

import re

headSections = re.split(r"^Head\(\d+\) Site\(\d+\).*$", log, flags=re.MULTILINE))

This will create a list of text sections divided by these Head lines, while removing the Head lines. If you want to save the Head lines as well, you can put the whole regex in parentheses.

Or if the files are too large to read into memory all at once, you can use the regex above to check whether you’ve entered a new section:

for line in logFile:
  if re.match(r"^Head\(\d+\) Site\(\d+\).*$", log, flags=re.MULTILINE):
    # start new section
  else:
    # add to existing section 

Regex demo

jdaz
  • 5,964
  • 2
  • 22
  • 34
  • So I can do that for a list that contains the lines under one Head(x) section. My issue is, how can I reset the list to load the next set of data from the next Head(x) section. I want to load each set of lines under each Head(x) section seperately. Hopefully with a list that resets itself after it hits the next Head(x) section @jdaz – edo101 Jul 24 '20 at 17:24
  • What do you mean by "reset the list"? Have you tried my first example with `split` above? It should work for you if I'm understanding what you're looking for correctly – jdaz Jul 24 '20 at 17:44
  • SO sorry, have been in meetings for the past 5 hours. I haven't tried it yet... though I am often hesitant to use regex because of how slow it is on Python – edo101 Jul 24 '20 at 23:11
  • Anyway to do that re match thing in a list comprehension? – edo101 Jul 24 '20 at 23:17
  • What would you want the comprehension to do? There is probably a way to split the sections with a comprehension, but if that is all you need you are better off with `re.split`. – jdaz Jul 26 '20 at 02:11
0

There are a few approaches for lazy reading a file. You could use seek operation to move the current file position backwards. I prefer a avoiding extra os-calls, so an approach similar to this post. You create a lazy iterator (a generator) that keeps a buffer of all lines read, wrapping the file to read each section one at a time. Lazy here means that when yield is called the function stops reading the file until called again.

def read_section(f):
    buffer = []
    while True:
        line = f.readline()
        if line == ""
            # EOF
            break
        # Can do better here probably
        if "Head" in line and "Site" in line:
           # return current buffer and start packing the next
           yield buffer
           buffer = [line]            
        else:
           buffer.append(line)
    # when finished file return last buffer
    yield buffer

The other option i mentioned is using seek, apperantly in python 3 you can only seek referencing the start of the file therefore i use tell to find current position.

def read_section(f):
    buffer = []
    line = f.readline()
    while line != "":
        # Checking buffer not empty to check if it is first header we are encountering
        if "Head" in line and "Site" in line and buffer:
           # Unread last line assuming utf8
           f.seek(f.tell() - len(l.encode('utf-8')))
           return buffer
        buffer.append(f.readline())
        if line == ""
            break
    return buffer
Eytan
  • 728
  • 1
  • 7
  • 19
0

UPDATE: Here is a re-engineered version of the original idea using a class and a somewhat different (and I hope cleaner) approach to parsing the data.

Assumptions:

  1. The data starts with a line like 'SITE_COUNT: nnn' where nnn is the number of site sections.
  2. Each site section is preceded by a line starting with '{' and is followed by a line starting with '}'.
  3. Header and site data is stored in arrays. It should be easy to change the code to store and return the data in other data types such as dict's.

The call Parse(f) in the code below creates a Parse object which parses the entire stream f and exposes methods to obtain the header and the (multiple) site data captured. (See the last few lines of the code).

For the sample input data below, the output is:

Site count is 2
Processing site 0
Processing site 1
Finished processing.
The headers are ['PRB_CARD: 108', 'PRB_FACILITY: LEHI']
Site 0 data is ['1000:1665.67', '1007:12.0638', '1011:27.728']
Site 1 data is ['1003:10.0677']

Code:

import re

# Define a class
class Parse():
    headers = []
    site_data = []
    def __init__(self, f):
        self.f = f
        # Capture the number of sites in the file.
        self.readline()
        m = re.match(r'SITE_COUNT: (\d+)', self.line)
        if m:
            self.site_count = int(m.group(1))
            print('Site count is', self.site_count)
        else:
            raise Exception('Invalid input file format')

        self.headers = self.capture_headers()

        for i in range(self.site_count):
            print('Processing site', i)
            self.site_data.append(self.capture_site())
        print('Finished processing.')

    def capture_headers(self):
        headers = []
        while self.readline():
            if self.line.startswith('{'):
                break
            if self.line.startswith('PRB_'):
                headers.append(self.line)
        return headers

    def capture_site(self):
        pat = re.compile('(\d+:\d*\.\d*):VALID')
        data = []
        while self.readline():
            if self.line.startswith('}'):
                break
            m = pat.match(self.line)
            if m:
                data.append(m.group(1))
        return data

    def get_headers(self):
        return self.headers

    def get_site_count(self):
        return self.site_count

    def get_site_data(self, i):
        return self.site_data[i]

    def readline(self):
        self.line = f.readline().rstrip('\n\r')
        return not self.line.startswith('FINISH_ETIME:') # Returns False at the end

# Run against the (slightly modified) data:
f = open('data.log')
p = Parse(f)
print('The headers are', p.get_headers())
for i in range(p.get_site_count()):
    print('Site', i, 'data is', p.get_site_data(i))

And the input is: (note the site count in the first line!)

SITE_COUNT: 2
PRB_CARD: 108
PRB_FACILITY: LEHI
PROCESS_ID: 88AFX
TEMP: 0
DATA_SET_ID: _P9kTbjdptOyonKO_
START_DATETIME: 05/01/2020 03:06:24
LOT: 0522072.0
.
.
.
+ 1 1588323984 1NA:0NN
{
Head(1) Site(0) (X,Y)=(-4,16)
VALID_COUNT 712

*Z
SITE:PARAM_SITE:N3P3
SITE:PROCESS_DURATION:81
1000:1665.67:VALID
.
.
1007:12.0638:VALID
1011:27.728:VALID
.
.
NUM_REGISTERS 712
NUM_VALID 6787
NUM_TESTED 6787
}
.
.
.
+ 2 1585959359 1NA:0NN
{
Head(1) Site(1) (X,Y)=(-2,4)
VALID_COUNT 583

*Z
SITE:PARAM_SITE:N2N3
SITE:PROCESS_DURATION:286
1003:10.0677:VALID
.
.
.
}
FINISH_ETIME: 1588324881
C. Pappy
  • 739
  • 4
  • 13
  • I think this isolates the logic of parsing the header from the logic of parsing the site data which, I think, is what your mentor had in mind ;) – C. Pappy Jul 24 '20 at 18:46
  • Wait, which function captures only the middle part? By middle part I mean the part that has all the the VALID sections – edo101 Jul 25 '20 at 00:49
  • Btw, the file itself is a zip file so it's contents are in binary. I have also already captured the headers I need. I only just need the data under the site sections. Just the lines with VALID in them for each section – edo101 Jul 25 '20 at 01:33
  • The cide abive illustrates how you could detect where each section begins and ends. It uses the fact that each site section start with '{' and ends with ']'. So, capture_headers(f) reads lines until it encounters the first { which signifies the end of the headers. What happens in capture_headers is, of course, up to the application and, yes, I realiza that you have already taken care of that task but I wanted to show runnable code. to be continued... – C. Pappy Jul 25 '20 at 02:17
  • ... Once the headers are processed, then capture_site(f) is called when the next { which signifies the beginning of a new site section and the next } which ends that section. As I understand your requirements, all (or some) of the lines in the section have to be collected and saved somehow. If you need help with exactly how you can accomplish that let me know. Function apture_site is called for each site section until all sections are processed – C. Pappy Jul 25 '20 at 02:19
  • 1
    I'll hold on to your solution for now as I haven;t had time to read it. This might be more apt for bigger files. I talked with my mentor and since the files are typically 500KB at most, I have decided to load the lines into a list. I then use some logic with tuples to find the range of these headesr and other sections and process them. I'll come backt to this when I am not at a project deadline – edo101 Jul 27 '20 at 13:57
  • Thanks for all the effort so far – edo101 Jul 27 '20 at 13:57