1

I have a list of citations in a csv file that I would like to use to fill out the XML based query form at CrossRef

CrossRef provides an XML template (below, with unused fields removed), and I would like to parse the columns of the csv file to fill out repeated fields within the query tag :

 <?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
  <query enable-multiple-hits="true"
            list-components="false"
            expanded-results="false" key="key">
    <article_title match="fuzzy"></article_title>
    <author search-all-authors="false"></author>
    <volume></volume>
    <year></year>
    <first_page></first_page>
    <journal_title></journal_title>
  </query>
</body>
</query_batch>

How can this be done in a shell script?

sample input:

author,year,article_title,journal_title,volume,first_page
Adler,2006,"Biomass yield and biofuel quality of switchgrass harvested in fall or spring","Agronomy Journal",98,1518
Alexopolou,2008,"Biomass yields for upland and lowland switchgrass varieties grown in the Mediterranean region","Biomass and Bioenergy",32,926
Balasko,1984,"Yield and Quality of Switchgrass Grown without Soil Amendments.","Agronomy Journal",76,204

desired output:

<?xml version = "1.0" encoding="UTF-8"?>
<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body>
 <query>
  <author>Adler</author >
  <year>2006</year >
  <article_title>Biomass yield and biofuel quality of switchgrass harvested in fall or spring</article_title >
  <journal_title>Agronomy Journal</journal_title >
  <volume>98</volume >
  <first_page>1518</first_page >
 </query>
 <query>
  <author>Alexopolou</author >
  <year>2008</year >
  <article_title>Biomass yields for upland and lowland switchgrass varieties grown in the Mediterranean region</article_title >
  <journal_title>Biomass and Bioenergy</journal_title >
  <volume>32</volume >
  <first_page>926</first_page >
 </query>
 <query>
  <author>Balasko</author >
  <year>1984</year >
  <article_title>Yield and Quality of Switchgrass Grown without Soil Amendments.</article_title >
  <journal_title>Agronomy Journal</journal_title >
  <volume>76</volume >
  <first_page>204</first_page >
 </query>
</body>

Other questions provide some help on doing this in C# and Java

Community
  • 1
  • 1
David LeBauer
  • 31,011
  • 31
  • 115
  • 189

2 Answers2

3
#!/usr/bin/awk -f
# XML Attributes Must be Quoted. Attribute values must always be quoted. Either single or double quotes can be used.

BEGIN{
    FS=","
    print "<?xml version = '1.0' encoding='UTF-8'?>"
    print "<query_batch xmlns:xsi='http://www.w3.org/2001/XMLSchema-instance' version='2.0' xmlns='http://www.crossref.org/qschema/2.0'"
    print "  xsi:schemaLocation='http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd'>"
    print "<head>"
    print "   <email_address>test@crossref.org</email_address>"
    print "   <doi_batch_id>test</doi_batch_id>"
    print "</head>"
    print "<body>"
}

NR>1{
    print "  <query enable-multiple-hits='true'"
    print "            list-components='false'"
    print "            expanded-results='false' key='key'>"
    print "    <article_title match='fuzzy'>" $3 "</article_title>"
    print "    <author search-all-authors='false'>" $1 "</author>"
    print "    <volume>" $5 "</volume>"
    print "    <year>" $2 "</year>"
    print "    <first_page>" $6 "</first_page>"
    print "    <journal_title>" $4 "</journal_title>"
    print "  </query>"
}

END{
    print "</body>"
    print "</query_batch>"
}

$ awk -f script.awk input.csv
David LeBauer
  • 31,011
  • 31
  • 115
  • 189
kev
  • 155,172
  • 47
  • 273
  • 272
  • Note that this doesn't do the escaping necessary to ensure valid XML files in output -- `&` needs to be converted to `&`, and the like, to be certain that the generated document is well-formed. – Charles Duffy Mar 27 '12 at 02:35
  • @kev I 1) saved this into foo.sh, 2) ran `chmod 777 foo.sh` and 3) ran `./foo.sh citations.csv`. Then I get a lot of errors, starting with `line 4: BEGIN{: command not found` and `Error: no such file ""`. Any ideas? – David LeBauer Mar 27 '12 at 22:57
  • @David the obvious thing for that bug would be the shebang line; it needs to be `#!/usr/bin/awk`; I suspect the `!` was left out as a typo. – Charles Duffy Mar 28 '12 at 03:02
3

Unlike the approaches using text substitution (ie. awk), this one is guaranteed to always emit a well-formed XML document, with content properly escaped. It's ugly, but it's far more correct. Note that this requires a 3rd-party tool; nothing included with the shell proper is capable of safely editing XML.

First, put a document with no body in template.xml:

<query_batch xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance" version="2.0" xmlns="http://www.crossref.org/qschema/2.0"
  xsi:schemaLocation="http://www.crossref.org/qschema/2.0 http://www.crossref.org/qschema/crossref_query_input2.0.xsd">
<head>
   <email_address>test@crossref.org</email_address>
   <doi_batch_id>test</doi_batch_id>
</head>
<body/>
</query_batch>

Second, build an XMLStarlet command line describing the edits desired, and invoke it:

#!/bin/bash
xmlstarlet_command=( )
read_header=0
while IFS=, read author year article_title journal_title volume first_page; do
  if (( read_header == 0 )); then read_header=1; continue; fi
  xmlstarlet_command+=( -s /qs:query_batch/qs:body -t elem -n query -v '' )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n list-components -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n expanded-results -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n key -v key )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t attr -n enable-multiple-hits -v true )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n article_title -v "$article_title" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/article-title' -t attr -n match -v fuzzy )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n author -v "$author" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]/author' -t attr -n search-all-authors -v false )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n volume -v "$volume" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n year -v "$year" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n first_page -v "$first_page" )
  xmlstarlet_command+=( -i '/qs:query_batch/qs:body/*[last()]' -t elem -n journal_title -v "$journal_title" )
done <in.csv
xmlstarlet ed -N qs=http://www.crossref.org/qschema/2.0 "${xmlstarlet_command[@]}" <template.xml

Note that, like other solutions given here, this doesn't strip the double quotes from the beginning and end of the CSV elements; like other aspects of advanced CSV parsing, this is better left to something like the Python CSV module, which actually knows how to recognize escaped quotes, text fields containing newlines, and all the other little oddities that can happen inside valid CSV files.

As an aside -- be aware that older versions of XMLStarlet have a limit on the number of operations per invocation fixed in the latest release. I have a workaround for this (which also allows edit lists longer than the ~32K or so maximum command line length), but it probably deserves to be its own question.

Jens
  • 69,818
  • 15
  • 125
  • 179
Charles Duffy
  • 280,126
  • 43
  • 390
  • 441
  • That looks nice, but where do you pass the name of the csv file? – David LeBauer Mar 27 '12 at 02:21
  • @David well, here, I'm hardcoding it as `in.csv`; replace that string with whatever you like. In an older version, I required the CSV file to be passed in on standard input. – Charles Duffy Mar 27 '12 at 02:30
  • I tried this, but get a segmentation fault error: `./foo.sh: line 21: 2869 Segmentation fault xmlstarlet ed -N qs=http://www.crossref.org/qschema/2.0 "${xmlstarlet_command[@]}" < template.xml` any ideas? – David LeBauer Mar 27 '12 at 22:54
  • @David see the last paragraph of my answer with a link to the XMLStarlet bugtracker -- if the number of operations is over 128, anything but the very latest version of XMLStarlet will crash, and with the program as-given, it's 14 operations per line in the input CSV. – Charles Duffy Mar 27 '12 at 22:55
  • @David wrote it up as a new question; see http://stackoverflow.com/questions/9898939/handling-long-edit-lists-in-xmlstarlet/9899041 -- note that this wrapper has different usage from the original xmlstarlet, but the differences should be explained adequately there. – Charles Duffy Mar 27 '12 at 23:14
  • let us [continue this discussion in chat](http://chat.stackoverflow.com/rooms/9385/discussion-between-charles-duffy-and-david) – Charles Duffy Mar 27 '12 at 23:24
  • I think using `xmlstarlet sel` instead of `ed` would make more sense in this case. Also, the 128 limit for `ed` has been lifted but a version with that fix hasn't been released yet. – npostavs Mar 29 '12 at 02:52
  • 1
    @npostavs yes, I acknowledged that this was fixed in my answer ("older versions of XMLStarlet..."). That said, I'm very unclear on how you'd use `xmlstarlet sel` to construct a new document using information from an external data source (in this case, a CSV). Perhaps you could add your own answer demonstrating? – Charles Duffy Mar 29 '12 at 04:59
  • I meant to clarify that the fix is currently unreleased. – npostavs Mar 30 '12 at 15:12