1

I have the following xml. I need to extract the IP address, protocol and port into a CSV file with the corresponding column names.

<rule family="ipv4">
<source address="10.XXX.XX.XX"/>
<port protocol="tcp" port="22"/>
<log prefix="ber_" level="warning">
<limit value="1/m"/>
</log>
<accept/>
</rule>
<rule family="ipv4">
<source address="10.XXX.XX.XXX"/>
<port protocol="udp" port="1025"/>
<log prefix="ber_" level="warning">
<limit value="1/m"/>
</log>
<accept/>

I'm able to grep IP address or the port using grep or sed like this grep -Eo "([0-9]{1,3}[\.]){3}[0-9]{1,3}" But I need it as columns in CSV file.

IPAddress Protocol Port . What is the best way to achieve this?

steve
  • 89
  • 1
  • 10
  • Get values using XPath. On Linux, you can use xmllint for that. – LMC Mar 14 '18 at 20:25
  • 1
    You cannot generally parse XML with regular expressions, as has been discussed hundreds of times on this site. If you want robust code that will not fail when the input format changes slightly you must use an XML parser to load the DOM, or an XPath evaluator. Either Perl or Python have everything you will need. – Jim Garrison Mar 14 '18 at 20:26
  • Please, provide **REAL** xml – Gilles Quénot Mar 14 '18 at 20:32
  • Possible duplicate of [get attribute value using xmlstarlet or xmllint](https://stackoverflow.com/questions/48595262/get-attribute-value-using-xmlstarlet-or-xmllint) – kvantour Mar 14 '18 at 22:00

2 Answers2

2

Dont' use regex to parse html/xml, but a real parser (using ):

Corrected wrong input xml file :

<root>
    <rule family="ipv4">
        <source address="10.XXX.XX.XX"/>
        <port protocol="tcp" port="22"/>
        <log prefix="ber_" level="warning">
            <limit value="1/m"/>
        </log>
    </rule>
    <rule family="ipv4">
        <source address="10.XXX.XX.XXX"/>
        <port protocol="udp" port="1025"/>
        <log prefix="ber_" level="warning">
            <limit value="1/m"/>
        </log>
    </rule>
</root>

Code :

xmlstarlet sel -t -v '//source/@address | //port/@protocol | //port/@port' file |
perl -pe '$. % 3 != 0 && s/\n/,/g;END{print "\n"}'

Output :

10.XXX.XX.XX,tcp,22
10.XXX.XX.XXX,udp,1025

theory :

According to the compiling theory, HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

realLife©®™ everyday tool in a :

You can use one of the following :

xmllint

xmlstarlet

saxon-lint (my own project)


Check: Using regular expressions with HTML tags

Gilles Quénot
  • 173,512
  • 41
  • 224
  • 223
0

lacking xml tools, here is a fragile awk solution

1$ awk -v RS='</rule>' '
       {for(i=1;i<=NF;i++) 
          if($i~/^(address|protocol|port)/) 
            {split($i,a,"\""); printf "%s", a[2] (++c%3?FS:ORS)}}' file

10.XXX.XX.XX tcp 22
10.XXX.XX.XXX udp 1025
karakfa
  • 66,216
  • 7
  • 41
  • 56