awk pipeline to extract and validate xml files

Question

How do I extract and validate xml files using awk and xmllint in a pipeline.

Awk program that only extracts files:

extractxml

#!/usr/bin/awk -f
/<?xml version/{ getline doctype; getline datadoc;
     if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
         fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next;
     }}{ print > fn }

The input concatenated xml file:

refcase.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa1234aa-20170101.XML">
<document-metatdata lang="EN" country="INTL">
<document-reference/>
</document-metatdata>
</data-document>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa2345bb-20170202.XML">
<document-metatdata lang="EN" country="LOCAL">
<document-reference/>
</document-metatdata>
</data-document>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa3456cc-20170303.XML">
<document-metatdata lang="EN" country="NA">
<document-reference/>
</document-metatdata>
</data-document>

Validation command:

xmllint --debug --dtdvalid refcase.dtd aa1234bb.xml

XML dtd file used by xmllint for validation of the xml file:

refcase.dtd

<?xml encoding="UTF-8"?>

<!ELEMENT data-document (document-metatdata)>
<!ATTLIST data-document
  xmlns CDATA #FIXED ''
  date-published CDATA #REQUIRED
  dtd-version CDATA #REQUIRED
  file NMTOKEN #REQUIRED

<!ELEMENT document-metatdata (document-reference)>
<!ATTLIST document-metatdata
  xmlns CDATA #FIXED ''
  country NMTOKEN #REQUIRED
  lang NMTOKEN #REQUIRED>

<!ELEMENT document-reference EMPTY>
<!ATTLIST document-reference
xmlns CDATA #FIXED ''>

When I add this code to the awk program:

{ print > fn } system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")

The awk extract still works fine and the .xml files are created as before.
The awk output is now passed to the xmllint command for xml validation and it looks like there is problem with the input into the xmllint command.

Awk program that extracts the files and sends the output to the xmllint command:

#!/usr/bin/awk -f
/<?xml version/{ getline doctype; getline datadoc;
     if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
         fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next;
     }}{ print > fn } system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")

Problem output from the xmllint command when invoked in awk:

aa1234aa.xml
aa1234aa.xml:5: parser error : Premature end of data in tag document-metatdata line 4
aa1234aa.xml:5: parser error : Premature end of data in tag data-document line 3
<document-metatdata lang="EN" country="INTL">
aa1234aa.xml:6: parser error : Premature end of data in tag document-metatdata line 4
aa1234aa.xml:6: parser error : Premature end of data in tag data-document line 3
<document-reference/>
aa1234aa.xml:7: parser error : Premature end of data in tag data-document line 3

The parser errors do not occur when the command is executed in the shell, the errors only occur when executed in the awk program. Which suggests to me the extracted xml files are okay.

It is an extraction process for thousands of concatenated txt files that each contain thousands of xml files. I need to trace and audit all the steps and validate the outputs.

Expected output of extracted xml files:

aa1234aa.XML

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa1234aa-20170101.XML">
<document-metatdata lang="EN" country="INTL">
<document-reference/>
</document-metatdata>
</data-document>

aa2345bb.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa2345bb-20170202.XML">
<document-metatdata lang="EN" country="LOCAL">
<document-reference/>
</document-metatdata>
</data-document>

aa3456cc.xml

<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa3456cc-20170303.XML">
<document-metatdata lang="EN" country="NA">
<document-reference/>
</document-metatdata>
</data-document>

Questions:

I would like awk to write the output to a file and redirect the output to a command for further processing.

Not sure if awk is the best tool for extractions, it has worked well so far across the test data. I need to log the process and validate the output.

Appreciate any other approaches that would be reliable and scalable?

Gabe, I'm pretty sure awk is not the tool you should be using. XML is inherently recursive in structure, and isn't a great fit for awk's line-by-line approach. [`xmlstarlet`](http://xmlstar.sourceforge.net/) is one common tool, and node.js or PHP might also be useful - e.g., [this answer](https://stackoverflow.com/a/3577662/2877364) for PHP. — cxw, Jun 06 '17 at 15:04

Ed Morton · Accepted Answer · 2017-06-07T11:40:56.560

Your posted command is:

/<?xml version/{ getline doctype; getline datadoc;
     if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
         fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next;
     }}{ print > fn } system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")

Step 1 is to fix it to use sensible formatting so we can see the control flow:

/<?xml version/{
     getline doctype
     getline datadoc;
     if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
         fn=a[1]".xml"
         print $0 ORS doctype ORS datadoc > fn
         print a[1]".xml"
         next
     }
}
{ print > fn }
system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")

OK, so now at a glance we can see that the system() call is in a condition block instead of an action, it's not closing output files as it goes, it's not quoting the xmllint file names, and it's hard-coding a[1]".xml" in multiple places so lets fix those:

/<?xml version/{
     getline doctype
     getline datadoc
     if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
         close(fn)
         fn=a[1]".xml"
         print $0 ORS doctype ORS datadoc > fn
         print fn
         next
     }
}
{
    print > fn
    system("xmllint --debug --dtdvalid refcase.dtd \047" fn "\047 > \047" fn ".rpt\047")
}

Now let's get rid of the fragile and unnecessary calls to getline:

/<?xml version/{
    xmlversion = $0
    cnt = 3
}
cnt==2 {
    doctype = $0
}
cnt==1 {
    datadoc = $0
    if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
        close(fn)
        fn=a[1]".xml"
        print xmlversion ORS doctype ORS datadoc > fn
        print fn
        next
    }
}
cnt { cnt--; next }
{
    print > fn
    system("xmllint --debug --dtdvalid refcase.dtd \047" fn "\047 > \047" fn ".rpt\047")
}

Now we can see that you're calling "xmllint" for every line that's output instead of on every output file that's completed. Change your command to this:

/<?xml version/{
    xmlversion = $0
    cnt = 3
}
cnt==2 {
    doctype = $0
}
cnt==1 {
    if (match($0,/file="([^-]+)-[^"]+.XML"/,a)) {
        lint(fn)
        fn=a[1]".xml"
        print xmlversion ORS doctype ORS $0 > fn
        print fn
        next
    }
}
cnt { cnt--; next}
{ print > fn }
END { lint(fn) }

function lint(fn) {
    if (fn != "") {
        close(fn)
        system("xmllint --debug --dtdvalid refcase.dtd \047" fn "\047 > \047" fn ".rpt\047")
        fn = ""
    }
}

Finally, given what I now know about your expected output, this is how I'd really write your script (also fixed the unescaped regexp metacharacters ? in <?xml and . in .XML that I hadn't spotted previously):

/<\?xml version/ {
    lint(fn)
    fn = ""
}
match($0,/file="([^-]+)-[^"]+\.XML"/,a) {
    fn = a[1]".xml"
    $0 = prev2 ORS prev1 ORS $0
    print fn
}
{
    if ( fn != "" ) {
        print > fn
    }
    prev2 = prev1
    prev1 = $0
}
END { lint(fn) }

function lint(fn) {
    if (fn != "") {
        close(fn)
        system("xmllint --debug --dtdvalid refcase.dtd \047" fn "\047 > \047" fn ".rpt\047")
    }
}

Thank you for all your help. The explanations, debugging and solution has been very useful and informative, grateful. — Gabe, Jun 07 '17 at 11:52

awk pipeline to extract and validate xml files

1 Answers1

Linked