1

We have a large log file containing following two lines:

00 LOG     |   Cycles Run:  120001
00 LOG     ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).

00 LOG     ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).

00 LOG     ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).

00 LOG     ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).

As the log file is big in size, we want to read line by line(not to read whole text in a buffer at a time), match specific set of patterns and pick values in separate variables.

e.g.

00 LOG     |   Cycles Run:  120001

We want o pick 120001 and store in a variable say cycle.

On the other hand we parse these lines:

00 LOG     ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).

00 LOG     ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).

00 LOG     ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).

00 LOG     ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).

Characters marked with ? can be any digit.

We want to store vairables like followings:

640733184 in var virtual_cur

1082470400 in var virtual_max

472154112 in var actual_cur

861736960 in var actual_max

Written a snippet in Python 3.6 but it's printing empty list:

import re

filename = "test.txt"
with open(filename) as fp:  
   line = fp.readline()
   while line:
       cycle_num = re.findall(r'00 LOG     |   Cycles Run:  (.*?)',line,re.DOTALL)
       line = fp.readline()

print (cycle_num[0])

NOTE: I want to pick each values in seperate variables and use it later on. Need to set 5 patterns one by one, pick value if it matches any specific pattern and put it inrespective variable.

Not sure about the wildcard matching for the second pattern.

Please suggest us a way to do this efficiently.

Community
  • 1
  • 1
Foobar-naut
  • 111
  • 3
  • 11
  • Do you want two variables, one for each value? – Paolo Aug 29 '18 at 19:39
  • Yes. For the second variable i'm not sure how to perse it out using pattern matching with wildcards. So, did not included it in the snippet. – Foobar-naut Aug 29 '18 at 19:50
  • 1
    Note that `|` is a regex meta character for alteration. Your example regex of `r'00 LOG | Cycles Run: (.*?)'` has the issue of looking for `00 LOG ` OR ` Cycles Run: (.*?)` which is why it is not matching anything. – dawg Aug 29 '18 at 20:23
  • Since you have `Max>` and `Current>` in the same line, how are you deciding which one is the target to capture? – dawg Aug 30 '18 at 14:11
  • We need to search for 4 different patterns. There are two lines with `Virtual: Max>` and `Current` each with `Virtual` and `Actual` respectively. We need to do different pattern matching to select the first and the last value as mentioned. – Foobar-naut Aug 30 '18 at 14:36
  • [This](https://regex101.com/r/L49460/1/) works for your example, but you are not clearly stating how to differentiate these values. – dawg Aug 30 '18 at 15:59

3 Answers3

2

With the regex

(?:(?:Cycles Run:[ \t]+)|(?:Current>[ \t]+))(\d+)

Demo

You can do something along these lines:

import re
pat=re.compile(r'(?:(?:Cycles Run:[ \t]+)|(?:Current>[ \t]+))(\d+)')
with open('test.txt','r') as f:   
    for line_num, line in enumerate(f):
        m=pat.search(line)
        if m:
            print(line_num, m.group(0))
dawg
  • 98,345
  • 23
  • 131
  • 206
  • 1
    I used group(0) just because it was unclear if the OP wanted to read that value into a dict or what. It identifies which value was captured. – dawg Aug 30 '18 at 00:25
  • Thanks a lot! Here you have matched the pattern using `Current>`. What if we have patterns like this: `00 LOG ! Virtual: Max> ?????????? bytes (?.???? gb), Current> 640733184 bytes (?.???? gb).` `00 LOG ! Virtual: Max> 1082470400 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).` `00 LOG ! Actual: Max> ????????? bytes (?.???? gb), Current> 472154112 bytes (?.???? gb).` `00 LOG ! Actual: Max> 861736960 bytes (?.???? gb), Current> ????????? bytes (?.???? gb).` We need to pick up these four values in four different variables. – Foobar-naut Aug 30 '18 at 05:58
  • Please update your question with those strings. It is unclear in a comment what you are looking for. – dawg Aug 30 '18 at 09:25
  • Updated the original question! :) – Foobar-naut Aug 30 '18 at 11:30
1

You may use an alternation here with two lookbehinds:

(?<=Cycles Run:  )\d+|(?<= Current>  )\d+

Regex demo here.


Python example:

import re
text = '''
00 LOG     |   Cycles Run:  120001
00 LOG     !   Virtual: Max> 1082470400 bytes (1.0081 gb), Current>  640733184 bytes (0.5967 gb)
'''

pattern = re.compile(r'(?<=Cycles Run:  )\d+|(?<= Current>  )\d+')
matches = re.findall(pattern,text)
num_cycle = matches[0]
current = matches[1]

print(num_cycle,current)

Prints:

120001 640733184

As you are repeating the process in a loop, it is recommended to use re.compile to compile the pattern only once before the loop.

Paolo
  • 21,270
  • 6
  • 38
  • 69
0

Here we search for some identifier (like cycles and apply a different regex)

import re
with open('test.txt','r') as f:
    for line in f:
        if re.search(r'Cycles',line):
            m=re.findall(r'\d+$',line)
        else:
            m=re.findall(r'Current>  (\d+)',line)
        print(m)
TheMaster
  • 45,448
  • 6
  • 62
  • 85
  • 1
    This works but seems rather inefficient. The patterns should at least be compiled before the loop. – Paolo Aug 29 '18 at 20:07
  • @Unbearable The doc states `the module-level matching functions are cached, so programs that use only a few regular expressions at a time needn’t worry about compiling regular expressions.` Since it's the same expression, I think caching will help. Also, The regex is really simple without lookaheads/behinds, which will save lots of time. This is obviously not elegant. – TheMaster Aug 29 '18 at 20:19
  • You are right, it does appear the patterns are [cached](https://stackoverflow.com/a/452143/3390419), my bad. Yes, the lookarounds can be expensive however only one `re` function is needed if using them. – Paolo Aug 29 '18 at 20:23