Regex to match / output yml documents in logfile containtaining specific string

Question

I have a logfile I'm tailing and want to output only those yaml documents (separated by ---) containing a specific string (specific domain in hostname).

Example logfile contents:

(focus on the hostname)

---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: a.b.c.domain.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"
---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: a.b.c.different.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"
---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: 1.2.3.domain.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"

expected output:

---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: a.b.c.domain.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"
---
event: outgoing HTTP response
timestamp: 2021-10-06T08:15:28.212Z
remoteAddress: "1.2.3.4"
hostname: 1.2.3.domain.com
statusCode: 200
headers:
  content-length: 524
  etc: ...
body: "blabla (can be multiline and can contain anything)"

I cannot get my head around the regex I need. Matching every document (regardless of what's inside) I'm doing this:

/---\n[\s\S]+?(?=\n---|$)/g

see also: https://regex101.com/r/a8zKSz/2

However I cannot figure out how to only output those documents matching hostname with the domain domain.com (regex for the match within could be e.g. /hostname: .*?domain\.com/

I like to end up having a sed / perl or any other "oneliner" applicable on a "default linux OS". tail -F logfile.log | oneliner But getting the regex is the first step.

Any hints or help is appreciated.

James · Accepted Answer · 2021-10-07T16:00:54.647

First of all, I have to say that regex are not the proper tool for this. If your input is Yaml, then use a tool made specially for Yaml.

For example, using yq, this can be done very easily:

cat example | yq eval 'select(.hostname | test ".domain.com")' -

Equivalently, for JSON inputs, there is jq.

Regex solution

Still, this is an interesting challenge, and might be cases where regexes are the most appropriate tool for the job. Here is a version that works.

Below, I wrote the pattern with added spacing, and split the regex on 4 lines to make it easier to read.

---\n
( (?!---|hostname:) [^\n]+? (\n|$) )*
hostname:[^\n]+.domain.com (\n|$)
( (?!---|hostname:) [^\n]+? (\n|$) )*
(?=---|$)

The principle here is to write the pattern as an explicit state machine. A regex always describe a state machine, but we tend not to thing about it; but here, we want to make this very obvious.

In the initial state, we look for a "yaml document start" marker (that is ---\n). When find such a line, we move to state #2.
In state #2, we capture input lines (exactly one line at a time). We however refuse to capture a line that starts with 'hostname:' (which will force a transition to state #3) nor a line that starts with --- (which will force the engine to backtrack on step #1).
In State #3, we capture a single line, starting with hostname:, but only if the rest of the line matches the expected domain. If such a line is captured, then we jump to state #4. If we can't match the line, then the engine can't continue (because of the negative lookahead in step #2) and will therefore backtrack on step #1).
In State #4, we continue capturing input lines, until we reach the end of that document (that is, until we reach the next line matching '---\n').

Perl solution

Given that neither the yq solution nor the regex solution is viable in your situation, here is yet another approach, this time using perl (no external module required).

Once again, I format the code so that it is easier to understand, but this can easily be reduced to a single line.

perl -ne '
    if ($_ =~ /^---$/) {
        $match = 0;
        $doc = $_;
    } elsif ($match) {
        print($_);
    } else {
        $doc .= $_;
        if ($_ =~ /^hostname: [^\n]*\.domain\.com$/) {
            print($doc);
            $match = 1;
        }
    }
'

Hey @jwatkins - thx a lot. The yq version is indeed very simple and elegant. And getting that answer from the repo owner makes it even nicer ;-) This I could test in my dev environment but cannot use in prod. I'm still trying to make the regex work on a tail -F with grep. I will accept your answer once tested successfully tomorrow. Thx again - also very nice explanation for the regex. — Simon, Oct 06 '21 at 17:55
I doubt it will work using `tail -F ... | grep ...`. By default, `grep` works one line at a time, so it is not possible to perform multiline matches. This behaviour can be change in GNU's `grep` using the `-zo` arguments (see [here](https://stackoverflow.com/questions/3717772/regex-grep-for-multi-line-search-needed/7167115#7167115) for explanations), but enabling this option makes `grep` wait for stdin to close before passing the input to the Regex library. That means you can't process streaming content using this strategy. — James, Oct 06 '21 at 19:10
The yq version works with stream (`tail -F`) but does not output the last document. When `cat file.log` I get everything but order is reversed. Using the regex with grep `cat logtest.log | grep -Pzo -- '---\n((?!---|hostname:)[^\n]+?(\n|$))+hostname:[^\n]+.domain\.com(\n|$)((?!---|hostname:)[^\n]+?(\n|$))+(?=---|$)'` works with a static file but does not with a stream (as you wrote in the comment). The Perl version works with stream/tail and cat but does also not output the last document. I have a modified perl version in my next comment which corrects this. — Simon, Oct 07 '21 at 12:01
`tail -F logtest.log | perl -ne ' $doc .= $_; if ($_ =~ /^---$/) { $match = 0; $doc = $_; } elsif ($_ =~ /^hostname: [^\n]*\.domain\.com$/) { $match = 1; } if ($match) { print($doc); $doc = ""; }'` — Simon, Oct 07 '21 at 12:01
You are right... assuming a streaming input, my initial perl suggestion would block until a new document comes in before outputting a matching document. I updated the code in my answer. Thanks! — James, Oct 07 '21 at 16:05

score 0 · Answer 2 · answered Oct 06 '21 at 20:29

0

Solution without regex in python. Consider your text in test.log

f=open('test.log','r')
contents=f.read().split('---')
for content in contents:
    if content:
        if '.domain.com' in content.splitlines()[4]:
            print(content)

answered Oct 06 '21 at 20:29

Rafiqul Islam

205
2
13

Regex to match / output yml documents in logfile containtaining specific string

2 Answers2