How do I match multiline expressions with junk in the middle?

Question

I'm trying to match a multiline expression from some logs we have. The biggest problem is due to race-conditions, we sometimes have to use a custom print function with a mutex, and sometimes (when that's not necessary) we just use printf. This results in two types of logs.

My solution was this monstrosity:

changed key '(\w+)' value: <((([0-9a-f]{2} *)+)(?:\n)*(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)*)*>

Explanation of the above regex:

changed key '(\w+)' value: - This is how we detect a print (and save the keyname in a capture group).
<{regex}> - The value output starts with < and ends with >
([0-9a-f]{2} *) - The bytes are hexadecimal pairs followed by an optional space (because last byte doesn't have a space). Let's call this capture group 4.
({group4}+) - One or more of group 4.
(?:\n)* - There can be 0 or more newlines after this "XX " pair. (non-capture)
(?:<\d+> \w+ (?:.*?] \[\d+\])\s*)* - There can be 0 or more prints of the timestamp. (non-capture)

This works for the Case 2 logs, but not for the Case 1 logs. In Case 1, for some reason only the last line is matched.

Essentially, I'm trying to match this (two capture groups):

changed key '(\w+)' value: <({only hexadecimal pairs})>

group 1: key
group 2: value

Below is the dummy cases (same value in all cases):

// Case 1
<22213> Nov 30 00:00:00.287 [D1]  [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1]  [128]
<22213> Nov 30 00:00:00.287 [D1]  [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1]  [128]ff ff
<22213> Nov 30 00:00:00.287 [D1]  [128]00 00 00 11 00 00 00 00 00 21>

// Case 2
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21>

// Case 2 with some newlines in the middle
changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00 00 

04 00 00 ff  
ff 00 00 00 11 00 

00 00 00 00 21>

The key isn't always the same key, so the value (and the value length) can change.

This is another case of [matching repeated capturing groups](https://stackoverflow.com/q/9764930/3832970). If you need to access the captures, you need to use PyPi regex library. See [this demo](https://regex101.com/r/uBncdX/1), do you want it like this? — Wiktor Stribiżew, Jul 28 '22 at 08:00
It would help your question to also include the text you want to extract based on the sample input you gave above. — Tim Biegeleisen, Jul 28 '22 at 08:02
@TimBiegeleisen "Essentially, I'm trying to match this". I did specify. I want to capture two groups: key, and value (which is all the hex pairs between < and >). — Alex Osheter, Jul 28 '22 at 08:06
[It works for me](https://tio.run/##tVJdT8IwFH3vr7jpC@0Gy9aRiBNHTCQ@GHk322LmaGHKPtINHDH@9tkNgijEgIntSZu259x7enPzdTnPUruu4yTPZAmSz3iFSl6VBVyDhzEeMsYs24VJtgLbBNN0WhhscAHerRUAeBYbBNE8TGd8Cq98DZ378ePT5OZh3IFVuFhyB4bhM@xgmhugE0OfymuC9s8LLQQIcU78Fpa1@8MWzHJVqbpILX8oBOw5h9YS/GsuhPazATrMpxgHGQOEZKWaom0RI8qSPF5wIr@bIP6bTr8cEEI8s3cZ9kTwzj5AozolI8dPqaa2oT/VXVACUAdDG6k6@5668wPqFxpV08UUiUxC048Qp@1eOAjUSBonlVHwUEZz0jzQ9j4WkGwYzchlnJYET8JEucGgQ2LMZLbMiUXpT9Id21DwFWDjJYtTkhhRmJdLyQvC6BGBvRV0jwjsY4L@L4K@EtT1Jw). — Wiktor Stribiżew, Jul 28 '22 at 08:21

Tim Biegeleisen · Accepted Answer · 2022-07-28T08:49:53.770

1

This approach starts by first stripping out the leading log content of each line, leaving behind the content you want to target. After that, it does an re.findall search using a regex pattern similar to the one you are already using.

inp = """<22213> Nov 30 00:00:00.287 [D1]  [128]changed key 'KEY_NAME' value: <ab ab ab ab 00 00 00
<22213> Nov 30 00:00:00.287 [D1]  [128]
<22213> Nov 30 00:00:00.287 [D1]  [128]00 04 00 00
<22213> Nov 30 00:00:00.287 [D1]  [128]ff ff
<22213> Nov 30 00:00:00.287 [D1]  [128]00 00 00 11 00 00 00 00 00 21>"""
inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], re.sub(r'\s+', ' ', x[1])) for x in matches]
print(matches)

This prints:

[('KEY_NAME', 'ab ab ab ab 00 00 00 00 04 00 00 ff ff 00 00 00 11 00 00 00 00 00 21')]

Assuming there could be unwanted values in between 'KEY_NAME' value: < and the closing >, we can use re.findall on the second group to match all hexadecimal values:

inp = re.sub(r'^<.*?>.*?(?:\s+\[.*?\])+', '', inp, flags=re.M)
matches = re.findall(r"changed key '(\w+)' value: <(.*?)>", inp, flags=re.S)
matches = [(x[0], ' '.join(re.findall(r'\b[a-f0-9]{2}\b', x[1]))) for x in matches]
print(matches)  # output same as above

edited Jul 28 '22 at 08:49

answered Jul 28 '22 at 08:15

Tim Biegeleisen

502,043
27
286
360

This approach is interesting, but could use some work. Because there's a race-condition issue, as I explained in the post. So you could theoretically have a value like "aa bb jasldkjasd 00 aa" and you wouldn't want to match the jibberish. – Alex Osheter Jul 28 '22 at 08:40
@AlexOsheter That's easy to fix. Just run `re.findall(r'\b[a-f0-9]{2}\b', x)`, where `x` is each second group from each match in my code above. That would leave with a string containing only hexadecimal values. – Tim Biegeleisen Jul 28 '22 at 08:41
To be honest, it's a little too much regex than I was expecting. Because we have thousands of lines in logs and we're essentially doing 4 regex operations per line. But hey, if it works, it works. – Alex Osheter Jul 28 '22 at 08:45
Check the second part of my updated answer, which now does a find all for hexadecimal values in the second capture group. – Tim Biegeleisen Jul 28 '22 at 08:50

How do I match multiline expressions with junk in the middle?

1 Answers1