extract data sed or grep or awk from xml

Question

I have the following xml. I need to extract the IP address, protocol and port into a CSV file with the corresponding column names.

<rule family="ipv4">
<source address="10.XXX.XX.XX"/>
<port protocol="tcp" port="22"/>
<log prefix="ber_" level="warning">
<limit value="1/m"/>
</log>
<accept/>
</rule>
<rule family="ipv4">
<source address="10.XXX.XX.XXX"/>
<port protocol="udp" port="1025"/>
<log prefix="ber_" level="warning">
<limit value="1/m"/>
</log>
<accept/>

I'm able to grep IP address or the port using grep or sed like this grep -Eo "([0-9]{1,3}[\.]){3}[0-9]{1,3}" But I need it as columns in CSV file.

IPAddress Protocol Port . What is the best way to achieve this?

Get values using XPath. On Linux, you can use xmllint for that. — LMC, Mar 14 '18 at 20:25
You cannot generally parse XML with regular expressions, as has been discussed hundreds of times on this site. If you want robust code that will not fail when the input format changes slightly you must use an XML parser to load the DOM, or an XPath evaluator. Either Perl or Python have everything you will need. — Jim Garrison, Mar 14 '18 at 20:26
Possible duplicate of [get attribute value using xmlstarlet or xmllint](https://stackoverflow.com/questions/48595262/get-attribute-value-using-xmlstarlet-or-xmllint) — kvantour, Mar 14 '18 at 22:00

Gilles Quénot · Answer 1 · 2018-03-22T00:16:08.090

Dont' use regex to parse html/xml, but a real parser (using xpath):

Corrected wrong input xml file :

<root>
    <rule family="ipv4">
        <source address="10.XXX.XX.XX"/>
        <port protocol="tcp" port="22"/>
        <log prefix="ber_" level="warning">
            <limit value="1/m"/>
        </log>
    </rule>
    <rule family="ipv4">
        <source address="10.XXX.XX.XXX"/>
        <port protocol="udp" port="1025"/>
        <log prefix="ber_" level="warning">
            <limit value="1/m"/>
        </log>
    </rule>
</root>

Code :

xmlstarlet sel -t -v '//source/@address | //port/@protocol | //port/@port' file |
perl -pe '$. % 3 != 0 && s/\n/,/g;END{print "\n"}'

Output :

10.XXX.XX.XX,tcp,22
10.XXX.XX.XXX,udp,1025

theory :

According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a shell :

You can use one of the following :

xmllint

xmlstarlet

saxon-lint (my own project)

Check: Using regular expressions with HTML tags

Instead of this, how can I get in columns? – steve Mar 20 '18 at 22:50 — steve, Mar 20 '18 at 22:50
Don't know what you mean – Gilles Quénot Mar 20 '18 at 23:43 — Gilles Quénot, Mar 20 '18 at 23:43
The output is 10.XXX.XX.XX tcp 22 – steve Mar 21 '18 at 23:56 — steve, Mar 21 '18 at 23:56

score 0 · Answer 2 · answered Mar 14 '18 at 20:41

lacking xml tools, here is a fragile awk solution

1$ awk -v RS='</rule>' '
       {for(i=1;i<=NF;i++) 
          if($i~/^(address|protocol|port)/) 
            {split($i,a,"\""); printf "%s", a[2] (++c%3?FS:ORS)}}' file

10.XXX.XX.XX tcp 22
10.XXX.XX.XXX udp 1025