extract xml tag and its value

Question

I want to read xml file and set its value into a variable. for example ,

qhr2400.xml

<XML>
<OPERATION type="1">
<TABLENAME>TABLE</TABLENAME>
<ROWSET>
<ROW>
<CLLI>518</CLLI>
<COLLECTION_DATE>06/04/20 00:45:00</COLLECTION_DATE>
<SS7RT>99</SS7RT>
<AQPRT_1>84</AQPRT_1>
<L7RMSUOCT_01>80</L7RMSUOCT_01>
<L7RMSUOCT_02>80</L7RMSUOCT_02>
</ROW>
</ROWSET>
</OPERATION>
</XML>

I want its value in a variable like $CLLI =518, $COLLECTION_DATE = 06/04/20 00:45:00, SS7RT = 99.. so that I can use these values further to write an insert query.

Basically I want to load this .xml data into a database table.

this is what I tried.

read_xml.sh

awk 'NF==1 && (/ +<[a-zA-Z]+>/ || /^<[a-zA-Z]+>/ || / +<\/[a-zA-Z]+>/){ 
next 
} 
{ 
sub(/^ +/,"") 
gsub(/\"|<|>/,"",$0); 
sub(/\/.*/,""); 
if($0){ 
  print 
} 
} 
' qhr2400.xml

Output

OPERATION type=1
CLLI5018
COLLECTION_DATE06
SS7RT99
AQPRT_184
L7RMSUOCT_0180
L7RMSUOCT_0280

Any help is appreciated.

Thanks!

Fixed wrong typo tag `TABLENANE/TABLENAME` (my solution works even with this broken XML) — Gilles Quénot, Jun 12 '20 at 16:45

score 0 · Answer 1 · edited Jun 20 '20 at 09:12

Fot this, you need an XML parser and xpath query in your shell, see:

$ xidel -se '//CLLI/text()' file.xml

When fixed your XML (opening/closing tag missmatch: TABLENANE/TABLENAME):

xmllint --xpath '//CLLI/text()' file

This command is installed with libxml2 and is far than exotic because it's installed by default on many Linux distros

Output

So now, you can retrieve all wanted values in shell variables, one example:

$ collectiondate=$(xidel -se '//COLLECTION_DATE/text()' file)
$ echo "$collectiondate"

But, please, don't use awk nor regex to parse XML.

There's others tools, check: How to execute XPath one-liners from shell?

Check too: Using regular expressions with HTML tags (same thing for XML)

Going further

declare -A arr
for i in CLLI COLLECTION_DATE SS7RT; do
    read arr[$i] < <(xmllint --xpath "//$i/text()" file.xml)
done

Now you have an associative array with CLLI COLLECTION_DATE SS7RT keys:

Keys:

printf '%s\n' "${!arr[@]}"
CLLI
SS7RT
COLLECTION_DATE

Values:

$ printf '%s\n' "${arr[@]}"
518
99
06/04/20 00:45:00

for COLLECTION_DATE:

$ echo "${arr[COLLECTION_DATE]}"
06/04/20 00:45:00

It's possible to feed a numeric array in one line too:

readarray a < <(xidel -se '//*[self::CLLI or self::COLLECTION_DATE or self::SS7RT]/text()' file.xml)

There's no need to call `xidel` multiple times. With one call it can set multiple variables at once. Maybe you can add that. — Reino, Jun 12 '20 at 19:44

score 0 · Answer 2 · answered Jun 12 '20 at 15:53

Don't parse XML/HTML with regex, use a proper XML/HTML parser and a powerful xpath query.

theory :

According to the compiling theory, XML/HTML can't be parsed using regex based on finite state machine. Due to hierarchical construction of XML/HTML you need to use a pushdown automaton and manipulate LALR grammar using tool like YACC.

Check this thread too, why-its-not-possible-to-use-regex-to-parse-html-xml

realLife©®™ everyday tool in a shell :

You can use one of the following :

xmllint often installed by default with libxml2, xpath1

xmlstarlet can edit, select, transform... Not installed by default, xpath1

xpath installed via perl's module XML::XPath, xpath1

xidel xpath3

saxon-lint my own project, wrapper over @Michael Kay's Saxon-HE Java library, xpath3

or you can use high level languages and proper libs, I think of :

python's lxml (from lxml import etree)

perl's XML::LibXML, XML::XPath, XML::Twig::XPath, HTML::TreeBuilder::XPath

ruby nokogiri, check this example

php DOMXpath, check this example

Check: Using regular expressions with HTML tags

score 0 · Answer 3 · answered Jun 13 '20 at 12:16

I want its value in a variable like $CLLI =518, $COLLECTION_DATE = 06/04/20 00:45:00, SS7RT = 99.. so that I can use these values further to write an insert query.

I'm going to interpret this as; you want every child-node, and its value, in the "ROW"-node exported as a variable.
As "Gilles Quenot" already mentioned, please don't parse xml with regex. I'd suggest you give xidel a try.

You could do it manually and call xidel for each and every node...

CLLI=$(xidel -s qhr2400.xml -e '//CLLI')
COLLECTION_DATE=$(xidel -s qhr2400.xml -e '//COLLECTION_DATE')
[...]

...but xidel itself can also export variables, multiple at once even:

#multiple queries, multiple declarations:
xidel -s qhr2400.xml -e 'CLLI:=//CLLI' -e 'COLLECTION_DATE:=//COLLECTION_DATE' -e '[...]' --output-format=bash
#or one query, multiple declarations:
xidel -s qhr2400.xml -e 'CLLI:=//CLLI,COLLECTION_DATE:=//COLLECTION_DATE,[...]' --output-format=bash
CLLI='518'
COLLECTION_DATE='06/04/20 00:45:00'
[...]

The output are just strings. To actually set/export these variables you have to use Bash's eval built-in command:

eval "$(xidel -s qhr2400.xml -e 'CLLI:=//CLLI,COLLECTION_DATE:=//COLLECTION_DATE,[...]' --output-format=bash)"

And finally, to do it fully automatic for every child-node in the "ROW"-node:

xidel -s qhr2400.xml -e '//ROW/*/name()'
CLLI
COLLECTION_DATE
SS7RT
AQPRT_1
L7RMSUOCT_01
L7RMSUOCT_02

xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"//ROW/{$x}")'
518
06/04/20 00:45:00
99
84
80
80

xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"{$x}:=//ROW{$x}")[0]' --output-format=bash
CLLI='518'
COLLECTION_DATE='06/04/20 00:45:00'
SS7RT='99'
AQPRT_1='84'
L7RMSUOCT_01='80'
L7RMSUOCT_02='80'
result=

eval "$(xidel -s qhr2400.xml -e 'for $x in //ROW/*/name() return eval(x"{$x}:=//ROW{$x}")[0]' --output-format=bash)"

score 0 · Answer 4 · answered Jun 13 '20 at 16:45

Another approach is to use XSLT (XSL Transformation)

Here is a fixed and indented version of the OP's XML file:

$ cat demo.xml
<XML>
    <OPERATION type="1">
       <TABLENAME>TABLE</TABLENAME>
       <ROWSET>
          <ROW>
             <CLLI>518</CLLI>
             <COLLECTION_DATE>06/04/20 00:45:00</COLLECTION_DATE>
             <SS7RT>99</SS7RT>
             <AQPRT_1>84</AQPRT_1>
             <L7RMSUOCT_01>80</L7RMSUOCT_01>
             <L7RMSUOCT_02>80</L7RMSUOCT_02>
          </ROW>
        </ROWSET>
    </OPERATION>
</XML>

This is the stylesheet I will use:

$ cat demo.xsl
<?xml version="1.0"?>
<xsl:stylesheet xmlns:xsl="http://www.w3.org/1999/XSL/Transform" version="1.0">
<xsl:output method="text" encoding="utf-8" />
<xsl:strip-space elements="*"/>

   <xsl:template match="ROW">
       <xsl:text>CLLI="</xsl:text><xsl:value-of select="CLLI"/><xsl:text>" </xsl:text>
       <xsl:text>COLLECTION_DATE="</xsl:text><xsl:value-of select="COLLECTION_DATE"/><xsl:text>" </xsl:text>
       <xsl:text>SS7RT="</xsl:text><xsl:value-of select="SS7RT"/><xsl:text>" </xsl:text>
       <xsl:text>AQPRT_1="</xsl:text><xsl:value-of select="AQPRT_1"/><xsl:text>" </xsl:text>
       <xsl:text>L7RMSUOCT_01="</xsl:text><xsl:value-of select="L7RMSUOCT_01"/><xsl:text>" </xsl:text>
       <xsl:text>L7RMSUOCT_02="</xsl:text><xsl:value-of select="L7RMSUOCT_02"/><xsl:text>" </xsl:text>
   </xsl:template>

   <xsl:template match="text()"/>

</xsl:stylesheet>

Here is a simple shell script which uses xsktproc to transform demo.xml into suitable text for input to eval in order to create shell variables for required element values.

$ cat demo.sh
#!/bin/bash

eval $(xsltproc demo.xsl demo.xml)

echo "CLLI: $CLLI"
echo "COLLECTION_DATE: $COLLECTION_DATE"
echo "SS7RT: $SS7RT"
echo "AQPRT_1: $AQPRT_1"
echo "L7RMSUOCT_01: $L7RMSUOCT_01"
echo "L7RMSUOCT_02: $L7RMSUOCT_02"

Run the script:

$ ./demo.sh
CLLI: 518
COLLECTION_DATE: 06/04/20 00:45:00
SS7RT: 99
AQPRT_1: 84
L7RMSUOCT_01: 80
L7RMSUOCT_02: 80
$

Slawomir Dziuba · Answer 5 · 2020-06-12T18:08:51.363

read_xml.sh

gawk '
BEGIN {
  FS="<|>"
}
// {
  {
    if($3 ~ /[0-9]/) { vars[$2] = $3; next }
  }
} 

END {
 print vars["CLLI"]
 print vars["SS7RT"]
 print vars["COLLECTION_DATE"]
 # etc...
}

' qhr2400.xml

result:

518
99
06/04/20 00:45:00

of course, instead of printing in END, you can use these variables from the vars array for something.

Rejecting AWK as an XML or HTML pareser is unreasonable. AWK is great as a parser for any files, including damaged xml files. Using AWK requires more thought, instead you don't need to install any exotic software. You can save the xml file so that AWK reads some lines incorrectly but the same can be said about xml analysis tools.

EDIT:

We fix the XML file error - splitting the field into several lines.

file qhr2400.xml contains:

<CLLI>
518
</CLLI>

instead of

<CLLI>518</CLLI>

call:

cat qhr2400.xml |tr -d '\n' |sed 's/ *//g' |sed 's/</\n</g' |awk -f readxml.awk

readxml.awk is now:

BEGIN {
FS="<|>"
}
// {
{
 if($3 ~ /[0-9]/) { vars[$2] = $3; next }
}
} 

END {
 print vars["CLLI"]
 print vars["SS7RT"]
 print vars["COLLECTION_DATE"]
 # etc...
}

the result is correct

EDIT2

For some time, there has been a worrying fashion for adding complexity instead of simplifying the environment. The use of a ready-made additional tool is usually a quick solution and may tempt you with its simplicity of use. Unfortunately, it is not always possible to install a huge Perl or Python or Ruby environment, e.g. on a built-in system with 32MB Flash, it is not always possible to compile any smaller tool for your processor architecture or company policy can rightly prohibit adding anything to the standard set, there is also sense for one-time processing of the file. AWK, sed, tr are usually equipped and it is the only rescue then. Also, not always parsing an XML file means wanting to extract key-value pairs, it can be something completely different, e.g. "ROW> <CLLI> 518 </CLLI> <COLLECTION" which makes useless ready analytical tools based on xpath. AWK is a programming language written specifically for parsing text files in a practicaly unlimited way if we add standard unix tools.

However, if you have little experience, better rely on ready-made solutions if possible.

@Gilles-Quenot I know this text, I agree on regex, but on AWK he maintains his opinion. You show the general use of search in it. AWK solutions are always specific and are not limited to key-value pairs. AWK you can parse any data file and xml is just a special case of such a file. — Slawomir Dziuba, Jun 12 '20 at 16:23
AWK is not designed to parse HTML nor XML. I doesn't agree to produce erroneous code anywhere that will fails one day or another. And the sysadmin will have to understand why the script failed, because the improper use of tools. The proper tool for the proper job, it's not a computing reserved concept — Gilles Quénot, Jun 12 '20 at 16:24
I understand your approach but who told you that xml / html must immediately mean searching for key-values? I usually have reverse cases of analysis of why an exotic tool suddenly stopped working. — Slawomir Dziuba, Jun 12 '20 at 16:30
Just one test, one simple case where if fails: https://pastebin.com/raw/zipzZfXi It's not because the OP's XML looks simple that every use will be, you can't assume all is fixed — Gilles Quénot, Jun 12 '20 at 16:39
Read what I wrote in the answer at the bottom. Scuffles don't make sense. — Slawomir Dziuba, Jun 12 '20 at 16:45
Scuffles ? Just plain valid XML syntax. Your awk failed, not xpath — Gilles Quénot, Jun 12 '20 at 16:52
You assumed the parsing of the key-values file. And if I wanted to get the string "ROW> 518 — Slawomir Dziuba, Jun 12 '20 at 16:57