awk field separator with in the xml

Question

I have a xml file with the following data.

<record record_no = "2" error_code="100">&quot;18383531&quot;;&quot;22677833&quot;;&quot;21459732&quot;;&quot;41001&quot;;&quot;394034&quot;;&quot;0208&quot;;&quot;Prime Lending - ;Corporate  - 2201&quot;;&quot;&quot;;&quot;Prime Lending - Lacey - 2508&quot;;&quot;Prime Lending - Lacey - 2508&quot;;&quot;1&quot;;&quot;rrvc&quot;;&quot;Tiffany Poe&quot;;&quot;HEIDI&quot;;&quot;BUNDY&quot;;&quot;000002274&quot;;&quot;2.0&quot;;&quot;18.0&quot;;&quot;2&quot;;&quot;362661&quot;;&quot;Rejected by IRS&quot;;&quot;A1AAA&quot;;&quot;20160720&quot;;&quot;1021&quot;;&quot;HEDI &amp; Bundy&quot;;&quot;4985045838&quot;;&quot;PPASSESS&quot;;&quot;Web&quot;;&quot;3683000826&quot;;&quot;823&quot;;&quot;IC W2&quot;;&quot;&quot;;&quot;&quot;;&quot;&quot;;&quot;&quot;;&quot;Rapid_20160801_Monthly.txt&quot;;&quot;20160720102100&quot;;&quot;&quot;;&quot;20160803095309&quot;;&quot;286023&quot;;&quot;RGT&quot;;&quot;1&quot;;&quot;14702324400223&quot;;&quot;14702324400223&quot;;&quot;0&quot;;&quot;OMCProcessed&quot;

I'm using following code:

cat RR_00404.fin.bc_lerr.xml.bc| awk 'BEGIN { FS=OFS=";" }/<record/ { gsub(/&quot;/,"\"");  gsub(/.*=" ">.*/,"",$1);print $1,$40,$43,$46 ,"'base_err_xml'", "0",$7; }'

The idea is to do the following:

Replace &quote; with "
Extract the error_code
Print " and ; seperated values.
Use sqlldr to load ( not to worry about this).

Problem to solve:

There is ; within the text. e.g Prime Lending -;Corporate - 2201
There's &

Output:

100;"20160803095309";"1";"1";"base_err_xml";"0";"Prime Lending
100;"286023";"14702324400223";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate  - 2201"
100;"286024-1";"";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate  - 2201"

Why in the world are you trying to use `awk` for this job in the first place? — Charles Duffy, Sep 13 '16 at 03:20
Standard answer: [don't](http://stackoverflow.com/a/1732454/4959010). — Sato Katsura, Sep 13 '16 at 03:34
Agree with most of the comments, however, the text inside xml is what is parsed here, not the xml tag. The XML file will just tag, it is not well formed xml. — user2570205, Sep 13 '16 at 16:05

Charles Duffy · Answer 1 · 2016-09-13T03:35:34.160

awk is the wrong tool for this job, without some preprocessing. Here, we use XMLStarlet for the first pass (decoding all XML entities and splitting attributes off into separate fields), and GNU awk for the second (reading those fields and performing whatever transforms or logic you actually need):

#!/bin/sh

# reads XML on stdin; puts record_no in first field, error code in second,
# ...record content for remainder of output line.

xmlstarlet sel -t -m '//record' \
  -v ./@record_no -o ';' \
  -v ./@error_code -o ';' \
  -v . -n

...and, cribbed from the GNU awk documentation...

#!/bin/env gawk -f
# must be GNU awk for the FPAT feature

BEGIN {
    FPAT = "([^;]*)|(\"[^\"]*\")"
}

{
    print "NF = ", NF
    for (i = 1; i <= NF; i++) {
        printf("$%d = <%s>\n", i, $i)
    }
}

Here, what we're doing with gawk is just showing how the fields get split, but obviously, you can modify the script for whatever needs you have.

A subset of output from the above for your given input file (when extended to actually be valid XML) is quoted below:

$1 = <2>
$2 = <100>
$9 = <"Prime Lending - ;Corporate  - 2201">

Note, then, that $1 is the record_no, $2 is the error_code, and $9 correctly contains the semicolon as literal content.

Obviously, you can encapsulate both these components in shell functions to avoid the need for separate files.

@EdMorton, ty, adopted your suggestion re: `FPAT`. – Charles Duffy Sep 13 '16 at 03:36 — Charles Duffy, Sep 13 '16 at 03:36

awk field separator with in the xml

1 Answers1