Read a text file line by line and store variables on matching specific pattern in Python

Question

We have a large log file containing following two lines:

00 LOG     |   Cycles Run:  120001
00 LOG     ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).

00 LOG     ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).

00 LOG     ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).

00 LOG     ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).

As the log file is big in size, we want to read line by line(not to read whole text in a buffer at a time), match specific set of patterns and pick values in separate variables.

e.g.

00 LOG     |   Cycles Run:  120001

We want o pick 120001 and store in a variable say cycle.

On the other hand we parse these lines:

00 LOG     ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).

00 LOG     ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).

00 LOG     ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).

00 LOG     ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).

Characters marked with ? can be any digit.

We want to store vairables like followings:

640733184 in var virtual_cur

1082470400 in var virtual_max

472154112 in var actual_cur

861736960 in var actual_max

Written a snippet in Python 3.6 but it's printing empty list:

import re

filename = "test.txt"
with open(filename) as fp:  
   line = fp.readline()
   while line:
       cycle_num = re.findall(r'00 LOG     |   Cycles Run:  (.*?)',line,re.DOTALL)
       line = fp.readline()

print (cycle_num[0])

NOTE: I want to pick each values in seperate variables and use it later on. Need to set 5 patterns one by one, pick value if it matches any specific pattern and put it inrespective variable.

Not sure about the wildcard matching for the second pattern.

Please suggest us a way to do this efficiently.

Yes. For the second variable i'm not sure how to perse it out using pattern matching with wildcards. So, did not included it in the snippet. — Foobar-naut, Aug 29 '18 at 19:50
Note that `|` is a regex meta character for alteration. Your example regex of `r'00 LOG | Cycles Run: (.*?)'` has the issue of looking for `00 LOG ` OR ` Cycles Run: (.*?)` which is why it is not matching anything. — dawg, Aug 29 '18 at 20:23
Since you have `Max>` and `Current>` in the same line, how are you deciding which one is the target to capture? — dawg, Aug 30 '18 at 14:11
We need to search for 4 different patterns. There are two lines with `Virtual: Max>` and `Current` each with `Virtual` and `Actual` respectively. We need to do different pattern matching to select the first and the last value as mentioned. — Foobar-naut, Aug 30 '18 at 14:36
[This](https://regex101.com/r/L49460/1/) works for your example, but you are not clearly stating how to differentiate these values. — dawg, Aug 30 '18 at 15:59

score 2 · Accepted Answer · answered Aug 29 '18 at 20:13

2

With the regex

(?:(?:Cycles Run:[ \t]+)|(?:Current>[ \t]+))(\d+)

Demo

You can do something along these lines:

import re
pat=re.compile(r'(?:(?:Cycles Run:[ \t]+)|(?:Current>[ \t]+))(\d+)')
with open('test.txt','r') as f:   
    for line_num, line in enumerate(f):
        m=pat.search(line)
        if m:
            print(line_num, m.group(0))

answered Aug 29 '18 at 20:13

dawg

98,345
23
131
206

1

I used group(0) just because it was unclear if the OP wanted to read that value into a dict or what. It identifies which value was captured. – dawg Aug 30 '18 at 00:25
Thanks a lot! Here you have matched the pattern using `Current>`. What if we have patterns like this: `00 LOG ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).` `00 LOG ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).` `00 LOG ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).` `00 LOG ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).` We need to pick up these four values in four different variables. – Foobar-naut Aug 30 '18 at 05:58
Please update your question with those strings. It is unclear in a comment what you are looking for. – dawg Aug 30 '18 at 09:25
Updated the original question! :) – Foobar-naut Aug 30 '18 at 11:30

score 1 · Answer 2 · answered Aug 29 '18 at 19:57

You may use an alternation here with two lookbehinds:

(?<=Cycles Run:  )\d+|(?<= Current>  )\d+

Regex demo here.

Python example:

import re
text = '''
00 LOG     |   Cycles Run:  120001
00 LOG     !   Virtual: Max> 1082470400 bytes (1.0081 gb), Current>  640733184 bytes (0.5967 gb)
'''

pattern = re.compile(r'(?<=Cycles Run:  )\d+|(?<= Current>  )\d+')
matches = re.findall(pattern,text)
num_cycle = matches[0]
current = matches[1]

print(num_cycle,current)

Prints:

120001 640733184

As you are repeating the process in a loop, it is recommended to use re.compile to compile the pattern only once before the loop.

score 0 · Answer 3 · answered Aug 29 '18 at 20:01

0

Here we search for some identifier (like cycles and apply a different regex)

import re
with open('test.txt','r') as f:
    for line in f:
        if re.search(r'Cycles',line):
            m=re.findall(r'\d+$',line)
        else:
            m=re.findall(r'Current>  (\d+)',line)
        print(m)

answered Aug 29 '18 at 20:01

TheMaster

45,448
6
62
85

1

This works but seems rather inefficient. The patterns should at least be compiled before the loop. – Paolo Aug 29 '18 at 20:07
@Unbearable The doc states `the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.` Since it's the same expression, I think caching will help. Also, The regex is really simple without lookaheads/behinds, which will save lots of time. This is obviously not elegant. – TheMaster Aug 29 '18 at 20:19
You are right, it does appear the patterns are [cached](https://stackoverflow.com/a/452143/3390419), my bad. Yes, the lookarounds can be expensive however only one `re` function is needed if using them. – Paolo Aug 29 '18 at 20:23

Read a text file line by line and store variables on matching specific pattern in Python

3 Answers3