How do I extract and validate xml files using awk and xmllint in a pipeline.
Awk program that only extracts files:
extractxml
#!/usr/bin/awk -f
/<?xml version/{ getline doctype; getline datadoc;
if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next;
}}{ print > fn }
The input concatenated xml file:
refcase.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa1234aa-20170101.XML">
<document-metatdata lang="EN" country="INTL">
<document-reference/>
</document-metatdata>
</data-document>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa2345bb-20170202.XML">
<document-metatdata lang="EN" country="LOCAL">
<document-reference/>
</document-metatdata>
</data-document>
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa3456cc-20170303.XML">
<document-metatdata lang="EN" country="NA">
<document-reference/>
</document-metatdata>
</data-document>
Validation command:
xmllint --debug --dtdvalid refcase.dtd aa1234bb.xml
XML dtd file used by xmllint for validation of the xml file:
refcase.dtd
<?xml encoding="UTF-8"?>
<!ELEMENT data-document (document-metatdata)>
<!ATTLIST data-document
xmlns CDATA #FIXED ''
date-published CDATA #REQUIRED
dtd-version CDATA #REQUIRED
file NMTOKEN #REQUIRED
<!ELEMENT document-metatdata (document-reference)>
<!ATTLIST document-metatdata
xmlns CDATA #FIXED ''
country NMTOKEN #REQUIRED
lang NMTOKEN #REQUIRED>
<!ELEMENT document-reference EMPTY>
<!ATTLIST document-reference
xmlns CDATA #FIXED ''>
When I add this code to the awk program:
{ print > fn } system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")
- The awk extract still works fine and the .xml files are created as before.
- The awk output is now passed to the xmllint command for xml validation and it looks like there is problem with the input into the xmllint command.
Awk program that extracts the files and sends the output to the xmllint command:
#!/usr/bin/awk -f
/<?xml version/{ getline doctype; getline datadoc;
if (match(datadoc,/file="([^-]+)-[^"]+.XML"/,a)) {
fn=a[1]".xml"; print $0 ORS doctype ORS datadoc > fn; print a[1]".xml" ; next;
}}{ print > fn } system("xmllint --debug --dtdvalid refcase.dtd " fn " > " a[1]".xml.rpt")
Problem output from the xmllint command when invoked in awk:
aa1234aa.xml
aa1234aa.xml:5: parser error : Premature end of data in tag document-metatdata line 4
aa1234aa.xml:5: parser error : Premature end of data in tag data-document line 3
<document-metatdata lang="EN" country="INTL">
aa1234aa.xml:6: parser error : Premature end of data in tag document-metatdata line 4
aa1234aa.xml:6: parser error : Premature end of data in tag data-document line 3
<document-reference/>
aa1234aa.xml:7: parser error : Premature end of data in tag data-document line 3
The parser errors do not occur when the command is executed in the shell, the errors only occur when executed in the awk program. Which suggests to me the extracted xml files are okay.
It is an extraction process for thousands of concatenated txt files that each contain thousands of xml files. I need to trace and audit all the steps and validate the outputs.
Expected output of extracted xml files:
aa1234aa.XML
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa1234aa-20170101.XML">
<document-metatdata lang="EN" country="INTL">
<document-reference/>
</document-metatdata>
</data-document>
aa2345bb.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa2345bb-20170202.XML">
<document-metatdata lang="EN" country="LOCAL">
<document-reference/>
</document-metatdata>
</data-document>
aa3456cc.xml
<?xml version="1.0" encoding="UTF-8"?>
<!DOCTYPE data-document SYSTEM "refcase.dtd" [ ]>
<data-document lang="EN" dtd-version="v1 2017-01-01" file="aa3456cc-20170303.XML">
<document-metatdata lang="EN" country="NA">
<document-reference/>
</document-metatdata>
</data-document>
Questions:
I would like awk to write the output to a file and redirect the output to a command for further processing.
Not sure if awk is the best tool for extractions, it has worked well so far across the test data. I need to log the process and validate the output.
Appreciate any other approaches that would be reliable and scalable?