extract critical numbers from a mixed log file

Question

I have a log file contained many slices like this:

Align set A and merge into set B ...
    setA, 4 images , image size 146 X 131
    setA, image 1, shape center shift (7, -9) compared to image center
    setA, image 2, shape center shift (8, -10) compared to image center
    setA, image 3, shape center shift (6, -9) compared to image center
    setA, image 4, shape center shift (6, -8) compared to image center
    final set B, image size 143 X 129
Write set B ...

Now, I want to extract the numbers in this slice into a table:

| width_A | height_A | shift_x | shift_y | width_B | height_B|
--- | --- | --- | ----| ---
A1 | 146 | 131 | 7 | -9 | 143 | 129
A2 | 146 | 131 | 8 | -10 | 143 | 129
A3 | 146 | 131 | 6 | -9 | 143 | 129
A4 | 146 | 131 | 6 | -8 | 143 | 129

If dividing the procedure into two parts, then:

text processing, read the text into a dictionary data, e.g., data['A1']['shift_x'] = 7.
use pandas convert the dictionary into dataframe: df = pd.DataFrame(data)

But I am not familiar with python text processing:

Different from Python: How to loop through blocks of lines, my log text are not so well organised;
regular expression may be a choice, but I can never remember the tricks to classify all kinds of symbols

Does anyone have a good solution for this? Python is preferred. Thanks in advance.

score 0 · Accepted Answer · edited May 23 '17 at 12:22

Find an answer myself finally:

import re

# store attribute as a turple, construct a dictionary, turple_attribute: pattern
regexp = {
    ('title', ): re.compile(r'Merge (.*) into set B.*\n' ),
    ('nimages', 'height_A', 'width_A'): re.compile(r'\s+setA, (\d{1,}) images , image size (\d{1,}) X (\d{1,}).*\n'),
    ('image_no', 'shift_x', 'shift_y'): re.compile(r'\s+setA, image (\d{1,}), shape center shift \((-?\d{1,}), (-?\d{1,})\) compared to image center.*\n'),
    ('gauge_no', ): re.compile(r'Write gauge (\d{1,}), set B.*') }

with open(log_file) as f:
    for line in f:
        print(line)
        for keys, pattern in regexp.iteritems():
            m = pattern.match(line)
            if m:          
                # traverse attributes
                for groupn, attr in enumerate(keys):  
                    # m.group(0): content of the entrire line
                    print str(groupn)+' '+attr + ' ' + m.group(groupn+1)

Reference

Didn't notice this question before I ask, Extracting info from large structured text files
Regular expression cheat table

extract critical numbers from a mixed log file

1 Answers1