0

I have a xml file with the following data.

<record record_no = "2" error_code="100">&quot;18383531&quot;;&quot;22677833&quot;;&quot;21459732&quot;;&quot;41001&quot;;&quot;394034&quot;;&quot;0208&quot;;&quot;Prime Lending - ;Corporate  - 2201&quot;;&quot;&quot;;&quot;Prime Lending - Lacey - 2508&quot;;&quot;Prime Lending - Lacey - 2508&quot;;&quot;1&quot;;&quot;rrvc&quot;;&quot;Tiffany Poe&quot;;&quot;HEIDI&quot;;&quot;BUNDY&quot;;&quot;000002274&quot;;&quot;2.0&quot;;&quot;18.0&quot;;&quot;2&quot;;&quot;362661&quot;;&quot;Rejected by IRS&quot;;&quot;A1AAA&quot;;&quot;20160720&quot;;&quot;1021&quot;;&quot;HEDI &amp; Bundy&quot;;&quot;4985045838&quot;;&quot;PPASSESS&quot;;&quot;Web&quot;;&quot;3683000826&quot;;&quot;823&quot;;&quot;IC W2&quot;;&quot;&quot;;&quot;&quot;;&quot;&quot;;&quot;&quot;;&quot;Rapid_20160801_Monthly.txt&quot;;&quot;20160720102100&quot;;&quot;&quot;;&quot;20160803095309&quot;;&quot;286023&quot;;&quot;RGT&quot;;&quot;1&quot;;&quot;14702324400223&quot;;&quot;14702324400223&quot;;&quot;0&quot;;&quot;OMCProcessed&quot;

I'm using following code:

cat RR_00404.fin.bc_lerr.xml.bc| awk 'BEGIN { FS=OFS=";" }/<record/ { gsub(/&quot;/,"\"");  gsub(/.*=" ">.*/,"",$1);print $1,$40,$43,$46 ,"'base_err_xml'", "0",$7; }' 

The idea is to do the following:

  1. Replace &quote; with "
  2. Extract the error_code
  3. Print " and ; seperated values.
  4. Use sqlldr to load ( not to worry about this).

Problem to solve:

  1. There is ; within the text. e.g Prime Lending -;Corporate - 2201
  2. There's &amp;

Output:

100;"20160803095309";"1";"1";"base_err_xml";"0";"Prime Lending
100;"286023";"14702324400223";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate  - 2201"
100;"286024-1";"";"OMCProcessed";"base_err_xml";"0";"Prime Lending - Corporate  - 2201"
user2570205
  • 137
  • 1
  • 11

1 Answers1

1

awk is the wrong tool for this job, without some preprocessing. Here, we use XMLStarlet for the first pass (decoding all XML entities and splitting attributes off into separate fields), and GNU awk for the second (reading those fields and performing whatever transforms or logic you actually need):

#!/bin/sh

# reads XML on stdin; puts record_no in first field, error code in second,
# ...record content for remainder of output line.

xmlstarlet sel -t -m '//record' \
  -v ./@record_no -o ';' \
  -v ./@error_code -o ';' \
  -v . -n

...and, cribbed from the GNU awk documentation...

#!/bin/env gawk -f
# must be GNU awk for the FPAT feature

BEGIN {
    FPAT = "([^;]*)|(\"[^\"]*\")"
}

{
    print "NF = ", NF
    for (i = 1; i <= NF; i++) {
        printf("$%d = <%s>\n", i, $i)
    }
}

Here, what we're doing with gawk is just showing how the fields get split, but obviously, you can modify the script for whatever needs you have.


A subset of output from the above for your given input file (when extended to actually be valid XML) is quoted below:

$1 = <2>
$2 = <100>
$9 = <"Prime Lending - ;Corporate  - 2201">

Note, then, that $1 is the record_no, $2 is the error_code, and $9 correctly contains the semicolon as literal content.


Obviously, you can encapsulate both these components in shell functions to avoid the need for separate files.

Charles Duffy
  • 280,126
  • 43
  • 390
  • 441