Extract specific XML pattern from log file using 'awk'

Question

I would like to extract from a log file that contains mostly Java log data (debug/errors/info) the following XML:

<envelope>
    <header>
        ...
    </header>
    <body>
        <Provision>
            <ORDER id="XYZ_123_456" action="test">
                ....
            </ORDER>
        </Provision>
    </body>
</envelope>

I only need the one which has the "Provision" tag, and which contains the ORDER id XYZ_123_456

I've tried using the following, but it also returns XMLs without the Provision tag. (I'm near clueless in awk, this is a code I've modified for this particular need)

awk '/<envelope>/ {line=$0; p=0 && x=0; next}
     line   {line=line ORS $0}
    /ORDER/ && $2~/XYZ_123_456/ {p=1}
    $0~/<Provision>/ {x=1}
   /<\/envelope>/ && p && x {print line;}' dump.file

Thanks!

What have you tried to resolve this issue yourself? Can you please share your current code? — Johan, Oct 06 '18 at 20:49
[You can't parse \[X\]HTML with regex](http://stackoverflow.com/a/1732454/3776858). I suggest to use an XML/HTML parser (xmlstarlet, e.g.). — Cyrus, Oct 06 '18 at 20:52
@Cyrus I don't need to parse, I need to extract an XML which is always in this pattern. The only variable is the order id. Plus this log file contains hundreds of thousands of lines which are mostly Java logs, wouldn't be easy to parse that. — Dunams, Oct 06 '18 at 20:53
You're doing this in the most error-prone way possible. You can use an XML parser like zorba (probably already installed if you have awk), and just tell it what you want. `(//envelope/body/Provision/ORDER/@id)` will extract the Order ID. — Terry Carmen, Oct 06 '18 at 21:07
@TerryCarmen I don't need to extract the order ID, I already have the order ID. I need to extract the entire block of XML containing that ID. — Dunams, Oct 06 '18 at 21:12
If your "logfile" contains other, non-XML data, you should show this. Otherwise, an XML parser with XPath expressions is a much saner approach than using regular expressions to extract nodes. — Corion, Oct 06 '18 at 22:23
Possible duplicate of [Extraction of data from a simple XML file](https://stackoverflow.com/q/2222150/608639), [Extract xml tag value using awk command](https://stackoverflow.com/q/14054203/608639), [Use awk to extract value from a line](https://stackoverflow.com/q/25175047/608639), etc. — jww, Oct 07 '18 at 01:11
You should include some of the surrounding non-XML text in your sample input so people don't keep advising you to use an XML parser, and then also show the expected output (your current sample input) to complete the [mcve]. — Ed Morton, Oct 07 '18 at 12:05

steffen · Answer 1 · 2018-10-07T18:08:33.517

You shouldn't parse xml with awk. Better use xmlstarlet. This will print the whole envelope:

$ apt-get install xmlstarlet
$ xmlstarlet sel -t -c '/envelope/body/Provision/ORDER[@id="XYZ_123_456"]/../../..' file.xml

For awk, I propose this:

awk '
    !el&&/<envelope>/{el=1}
    el==1&&/<body>/{el=2}
    el==2&&/<Provision>/{el=3}
    el==3&&/<ORDER.*id="XYZ_123_456"/{el=4;hit=1}
    el>0{buffer=buffer $0 ORS}
    el==4&&/<\/ORDER>/{el=3}
    el==3&&/<\/Provision>/{el=2}
    el==2&&/<\/body>/{el=1}
    el==1&&/<\/envelope>/{el=0;if (hit){print buffer; buffer="";hit=0}}
' file.xml

This checks for the correct XML structure and print the whole envelope given the xml elements come on different lines each.

Corion · Answer 2 · 2018-10-07T12:19:02.397

If your XML or logfile is as well-formed as you claim, you can (ab)use awk and its RS record separator feature to do most of the parsing for you:

 awk 'BEGIN{ RS="</envelope>"; FS="<envelope>" }; $0 ~ order { print "<envelope>",$2,"</envelope>" }' order=XYZ_123_456 tmp.txt

This works by defining </envelope> as the awk record separator and then reading all stuff between </envelope> strings. To then strip/split other log messages, I use the FS field separator to split the "line" into columns, and output the second column.

This will horribly fail if any <envelope> or </envelope> string happens to appear anywhere else in your log data, but you've already been pointed towards better XML parsers.

As the above solution requires GNU awk for multi-char RS, here is the same solution using perl for the case that no appropriate awk version is available:

 perl -ne 'BEGIN{ $/="</envelope>";$order=shift }; /<envelope>.*$order.*/ms and print $&' XYZ_123_456 tmp.txt

You should mention that requires GNU awk for multi-char RS. – Ed Morton Oct 07 '18 at 12:03 — Ed Morton, Oct 07 '18 at 12:03

Ed Morton · Answer 3 · 2018-10-07T12:08:21.010

$ cat tst.awk
/<envelope>/ { inEnv = 1 }
inEnv { env = env $0 ORS }
/<\/envelope>/ {
    if ( env ~ /<Provision>.*<ORDER[[:space:]]+id="XYZ_123_456"/ ) {
        printf "%s", env
    }
    env = inEnv = ""
}

$ awk -f tst.awk file
<envelope>
    <header>
        ...
    </header>
    <body>
        <Provision>
            <ORDER id="XYZ_123_456" action="test">
                ....
            </ORDER>
        </Provision>
    </body>
</envelope>

Extract specific XML pattern from log file using 'awk'

3 Answers3