0

I'm trying to extract bibliographic data from the Library of Congress Web service, an example of the resultant xml may be seen here. To summarize, it looks like this:

<zs:searchRetrieveResponse>
  <zs:version>1.1</zs:version>
  <zs:numberOfRecords>1</zs:numberOfRecords>
  <zs:records>
    <zs:record>
      <zs:recordSchema>info:srw/schema/1/mods-v3.2</zs:recordSchema>
      <zs:recordPacking>xml</zs:recordPacking>
      <zs:recordData>
        <mods version="3.2" xsi:schemaLocation="http://www.loc.gov/mods/v3 http://www.loc.gov/standards/mods/v3/mods-3-2.xsd">
          (Actual data I care about)
        </mods>
      </zs:recordData>
      <zs:recordPosition>1</zs:recordPosition>
    </zs:record>
  </zs:records>
</zs:searchRetrieveResponse>

I used xmlbeans to compile a Java client to read the data inside the "mods" tag since it has an associated schema. So, essentially, I need to extract the mods tags and their contents and treat all that as a separate XML document. I could do this with regex but would prefer a real XML solution ("never parse XML with regex" I hear continuously). I wrote the following SSCCE code.

import java.io.BufferedReader;
import java.io.ByteArrayInputStream;
import java.io.IOException;
import java.io.InputStream;
import java.io.InputStreamReader;
import java.net.HttpURLConnection;
import java.net.URL;
import javax.xml.parsers.DocumentBuilder;
import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.parsers.ParserConfigurationException;
import javax.xml.xpath.XPath;
import javax.xml.xpath.XPathConstants;
import javax.xml.xpath.XPathExpression;
import javax.xml.xpath.XPathExpressionException;
import javax.xml.xpath.XPathFactory;
import org.w3c.dom.Document;
import org.xml.sax.SAXException;
public class LibraryOfCongress {
  public static void main(String[] args) throws XPathExpressionException,
      ParserConfigurationException, SAXException, IOException {
    String URL = "http://z3950.loc.gov:7090/voyager?operation=searchRetrieve&version=1.1&recordSchema=mods&maximumRecords=1&query=bath.isbn=0120502577";
    HttpURLConnection conn = (HttpURLConnection) (new URL(URL))
        .openConnection();
    conn.setRequestMethod("GET");
    int responseCode = conn.getResponseCode();
    String document = null;
    if (responseCode == HttpURLConnection.HTTP_OK) {
      BufferedReader rd;
      InputStream in = conn.getInputStream();
      rd = new BufferedReader(new InputStreamReader(in));
      String tempLine = rd.readLine();
      StringBuilder response = new StringBuilder();
      while (tempLine != null) {
        response.append(tempLine).append("\n");
        tempLine = rd.readLine();
      }
      document = response.toString();
      rd.close();
    }
    if(document==null) return;
    ByteArrayInputStream stream = new ByteArrayInputStream(document.getBytes());
    DocumentBuilderFactory factory = DocumentBuilderFactory.newInstance();
    DocumentBuilder builder = factory.newDocumentBuilder();
    Document doc = builder.parse(stream);
    XPathFactory xPathfactory = XPathFactory.newInstance();
    XPath xpath = xPathfactory.newXPath();
    XPathExpression expr = xpath
        .compile("/zs:searchRetrieveResponse/zs:records/zs:recordData");
    Document ret = (Document) expr.evaluate(doc, XPathConstants.NODE);
    if(ret!=null) {
      String retval = ret.toString();
      System.out.println(retval);
    }
  }
}

It doesn't do anything because ret is null. The variations I tried:

1)

  .compile("/");
  ...
  String ret = (String) expr.evaluate(doc);

Returns the document sans any tags. This is the only output I've been able to finagle but of course I need the tags to pass to the client generated by xmlbeans.

2) Various other XPath query strings but I can't get useful output specifying anything beyond the root node.

Some additional concerns:

1) I've read that XPathConstants.NODE still has some sort of reference back to the original document and will not produce an independent document like I require. Not sure what to do about that, I would think having independently parse-able nodes would be one of the major reasons for XPath.

2) I have no idea how to handle the namespaces in the XPath expression. I just took a guess.

KevinRethwisch
  • 237
  • 2
  • 13
  • 1
    When your input XML uses namespaces (which it does) you should also declare the namespaces within Java for using it in XPath. See next answer on this question: http://stackoverflow.com/questions/3939636/how-to-use-xpath-on-xml-docs-having-default-namespace – Mark Veenstra Nov 10 '13 at 08:51

1 Answers1

1

If you want to use XPath against XML with namespaces then make sure you use a namespace-aware DocumentBuilder by calling http://docs.oracle.com/javase/7/docs/api/javax/xml/parsers/DocumentBuilderFactory.html#setNamespaceAware%28boolean%29 on your DocumentBuilderFactory. Then to apply an XPath expression with namespaces you need to implement a NamespaceContext, I think Mark has already linked to a page showing that in his comment.

As for selecting a new document with XPath, no, that is not what XPath does at all. It allows you to select nodes in an existing document and to navigate around thus if you select a particular node down in the hierarchy you get that node but it is still in the document with all its children and descendants as well as its ancestors and sibling.

Thus if you want to create a new, standalone document you will need to create one with a DocumentBuilder and http://docs.oracle.com/javase/7/docs/api/javax/xml/parsers/DocumentBuilder.html#newDocument%28%29 and then you can importNode or adoptNode what you selected with XPath in your input document and finally appendChild.

Martin Honnen
  • 160,499
  • 6
  • 90
  • 110