Python regular expression to parse text file

Question

My goal is to lift a few values from a text file and generate a plot using matplotlib...

I have several large (~100MB) text log files generated from a python script that is calling tensorflow. I save the terminal output from running the script like this:

python my_script.py 2>&1 | tee mylog.txt

Here's a snippet from the text file that I'm trying to parse and turn into a dictionary:

Epoch 00001: saving model to /root/data-cache/data/tmp/models/ota-cfo-full_20200626-173916_01_0.05056382_0.99.h5

5938/5938 [==============================] - 4312s 726ms/step - loss: 0.1190 - accuracy: 0.9583 - val_loss: 0.0506 - val_accuracy: 0.9854

I'm specifically trying to lift epoch number (0001), the time in seconds (4312), loss (0.1190), accuracy (0.9538), val_loss (0.0506) and val_accuracy for 100 epochs so I can make a plot using matplotlib.

The log file is full of other junk that I don't want like:

Epoch 1/100

   1/5938 [..............................] - ETA: 0s - loss: 1.7893 - accuracy: 0.31252020-06-26 17:39:45.253972: I tensorflow/core/profiler/internal/gpu/cupti_tracer.cc:1479] CUPTI activity buffer flushed
2020-06-26 17:39:45.255588: I tensorflow/core/profiler/internal/gpu/device_tracer.cc:216]  GpuTracer has collected 179 callback api events and 179 activity events.
2020-06-26 17:39:45.276306: I tensorflow/core/profiler/rpc/client/save_profile.cc:168] Creating directory: /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45
2020-06-26 17:39:45.284235: I tensorflow/core/profiler/rpc/client/save_profile.cc:174] Dumped gzipped tool data for trace.json.gz to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.trace.json.gz
2020-06-26 17:39:45.286639: I tensorflow/core/profiler/utils/event_span.cc:288] Generation of step-events took 0.049 ms

2020-06-26 17:39:45.288257: I tensorflow/python/profiler/internal/profiler_wrapper.cc:87] Creating directory: /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45Dumped tool data for overview_page.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.overview_page.pb
Dumped tool data for input_pipeline.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.input_pipeline.pb
Dumped tool data for tensorflow_stats.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.tensorflow_stats.pb
Dumped tool data for kernel_stats.pb to /root/data-cache/data/tmp/20200626-173933/train/plugins/profile/2020_06_26_17_39_45/ddfc870f32d1.kernel_stats.pb


   2/5938 [..............................] - ETA: 6:24 - loss: 1.7824 - accuracy: 0.2656
   3/5938 [..............................] - ETA: 17:03 - loss: 1.7562 - accuracy: 0.2396
   4/5938 [..............................] - ETA: 22:27 - loss: 1.7368 - accuracy: 0.2344
   5/5938 [..............................] - ETA: 22:55 - loss: 1.7387 - accuracy: 0.2375
   6/5938 [..............................] - ETA: 24:16 - loss: 1.7175 - accuracy: 0.2656
   7/5938 [..............................] - ETA: 24:34 - loss: 1.6885 - accuracy: 0.2812

Epoch 54/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0088 - accuracy: 0.9984       
Epoch 00054: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_54_0.0054215402_1.00.h5
1500/1500 [==============================] - 942s 628ms/step - loss: 0.0088 - accuracy: 0.9984 - val_loss: 0.0054 - val_accuracy: 0.9993
2020-07-02 14:42:29.102025: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 674 of 1000
2020-07-02 14:42:33.511163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Epoch 55/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0136 - accuracy: 0.9978      
Epoch 00055: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_55_0.0036424326_1.00.h5
1500/1500 [==============================] - 948s 632ms/step - loss: 0.0136 - accuracy: 0.9978 - val_loss: 0.0036 - val_accuracy: 0.9990
2020-07-02 14:58:32.042963: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 690 of 1000
2020-07-02 14:58:36.302518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.

This almost works but doesn't get the epoch number:

regular_exp = re.compile(r'(?P<loss>\d+\.\d+)\s+-\s+accuracy:\s*(?P<accuracy>\d+\.\d+)\s+-\s+val_loss:\s*(?P<val_loss>\d+\.\d+)\s*-\s*val_accuracy:\s*(?P<val_accuracy>\d+\.\d+)', re.M)
with open(log_file, 'r') as file:
    results = [ match.groupdict() for match in regular_exp.finditer(file.read()) ]

I've also tried just reading the file in but it has these weird x08's everywhere.

from pprint import pprint as pp
log_file = 'mylog.txt'
text_file = open(log_file, "r")
lines = text_file.readlines()
pp (lines)
' 291/1500 [====>.........................] - ETA: 12:19 - loss: 0.7179 - '
 'accuracy: '
 '0.7163\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\n',
 ' 292/1500 [====>.........................] - ETA: 12:18 - loss: 0.7164 - '
 'accuracy: '
 '0.7168\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\x08\n',

Can someone help me construct a regular expression in python that will let me make a dictionary of these values?

My goal is something like this:

[{'iteration': '00', 'seconds': '1802', 'loss': '0.3430', 'accuracy': '0.8753', 'val_loss': '0.1110', 'val_accuracy': '0.9670', 'epoch_num': '00002', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_02_0.069291627_0.98.h5'}, {'iteration': '1500/1500', 'seconds': '1679', 'loss': '0.0849', 'accuracy': '0.9739', 'val_loss': '0.0693', 'val_accuracy': '0.9807', 'epoch_num': '00003', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_03_0.055876694_0.98.h5'}, {'iteration': '1500/1500', 'seconds': '1674', 'loss': '0.0742', 'accuracy': '0.9791', 'val_loss': '0.0559', 'val_accuracy': '0.9845', 'epoch_num': '00004', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_04_0.053867317_0.99.h5'}, {'iteration': '1500/1500', 'seconds': '1671', 'loss': '0.0565', 'accuracy': '0.9841', 'val_loss': '0.0539', 'val_accuracy': '0.9850', 'epoch_num': '00005', 'epoch_file': '/root/data-cache/data/tmp/models/ota-cfo-10k_20200527-001913_05_0.053266536_0.99.h5'}]

A list of dictionaries with key value pairs of epoch, seconds, loss, accuracy, val_loss, val_accuracy, model file... — random_dsp_guy, Jul 01 '20 at 22:37
Hmm, I did stuff to python dict converters, too. Use dictionary comprehensions (https://stackoverflow.com/questions/14507591/python-dictionary-comprehension) with regexes to help extract the data. — theX, Jul 01 '20 at 22:41
Need a bit more information on the exact format, could you post an entire epoch output (with most of the stuff in the middle cut out), leaving the start and end intact? — Chase, Jul 02 '20 at 15:37

score 2 · Accepted Answer · answered Jul 02 '20 at 16:41

You can achieve this using a combination of the correct regex, list comprehension, groupdict, and finditer

First things first - we need a baseline and standardized text format. This is important - if you think your text content does not match this, perhaps try replacing all \x08 bytes (and all other unnecessary bytes for that matter) with blank space. (\x08 just means backspace)

data = """Epoch 54/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0088 - accuracy: 0.9984       
Epoch 00054: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_54_0.0054215402_1.00.h5
1500/1500 [==============================] - 942s 628ms/step - loss: 0.0088 - accuracy: 0.9984 - val_loss: 0.0054 - val_accuracy: 0.9993
2020-07-02 14:42:29.102025: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 674 of 1000
2020-07-02 14:42:33.511163: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled.
Epoch 55/100
1500/1500 [==============================] - ETA: 0s - loss: 0.0136 - accuracy: 0.9978      
Epoch 00055: saving model to /root/data-cache/data/tmp/models/ota-cfo-10k-clean_20200701-205945_55_0.0036424326_1.00.h5
1500/1500 [==============================] - 948s 632ms/step - loss: 0.0136 - accuracy: 0.9978 - val_loss: 0.0036 - val_accuracy: 0.9990
2020-07-02 14:58:32.042963: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:184] Filling up shuffle buffer (this may take a while): 690 of 1000
2020-07-02 14:58:36.302518: I tensorflow/core/kernels/data/shuffle_dataset_op.cc:233] Shuffle buffer filled."""

That's a one to one replication of the latest example you provided. It seemed to most "complete" and I'll be using this.

The regex you need should be -

ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)

Check out the demo!

Now, you need a singular line to work out all the magic-

info_list = [match.groupdict() for match in re.finditer(r'ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)', data)]

I definitely recommend compiling the pattern first though, for this specific case.

DATA_PATTERN = re.compile(r'ETA: (?P<ETA>[\d\.]+)s - loss: (?P<loss>[\d\.]+) - accuracy: (?P<accuracy>[\d\.]+)\s+Epoch (?P<iteration>\d+)')
info_list = [match.groupdict() for match in DATA_PATTERN.finditer(data)]

Output-

[{'ETA': '0', 'loss': '0.0088', 'accuracy': '0.9984', 'iteration': '00054'},
 {'ETA': '0', 'loss': '0.0136', 'accuracy': '0.9978', 'iteration': '00055'}]

Good answer. A fun trick I use a lot with painful regexes is to make use of the "extended" flag `re.X`. This lets you make something like this: https://gitlab.com/snippets/1992154 — Adam Smith, Jul 02 '20 at 16:51

Python regular expression to parse text file

1 Answers1