5

I'm attempting to use JDOM2 in order to extract the information I care about out of a XML document. How do I get a tag within a tag?

I have been only partially successful. While I have been able to use xpath to extract <record> tags, the xpath query to extract the title, description and other data with in the record tags has been returning null.

I've been using Xpath successfully to extract <record> tags out of the document. To do this I use the follwing xpath query: "//oai:record" where the "oai" namespace is a namespace I made up in order to use xpath.

You can see the XML document I'm parsing here, and I've put a sample below: http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&set=cwp&metadataPrefix=oai_dc

<record>
    <header>
        <identifier>oai:lcoa1.loc.gov:loc.pnp/cph.3a02293</identifier>
        <datestamp>2009-05-27T07:22:37Z</datestamp>
        <setSpec>cwp</setSpec>
        <setSpec>lcphotos</setSpec>
    </header>
    <metadata>
        <oai_dc:dc xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/                          http://www.openarchives.org/OAI/2.0/oai_dc.xsd">
            <dc:title>Jubal A. Early</dc:title>
            <dc:description>This record contains unverified, old data from caption card.</dc:description>
            <dc:date>[between 1860 and 1880]</dc:date>
            <dc:type>image</dc:type>
            <dc:type>still image</dc:type>
            <dc:identifier>http://hdl.loc.gov/loc.pnp/cph.3a02293</dc:identifier>
            <dc:language>eng</dc:language>
            <dc:rights>No known restrictions on publication.</dc:rights>
        </oai_dc:dc>
    </metadata>
</record>

If you look in the larger document you will see that there is never a "xmlns" attribute listed on any of the tags. There is also the matter of there being three different namespaces in the document ("none/oai", "oai_dc", "dc").

What is happening is that the xpath is matching nothing, and evaluateFirst(parent) is returning null.

Here is some of my code to extract the title, date, description etc. out of the record element.

    XPathFactory xpf = XPathFactory.instance();
    XPathExpression<Element> xpath = xpf.compile("//dc:title",
                  Filters.element(), null,
                  namespaceList.toArray(new Namespace[namespaceList.size()]));
    Element tag = xpath.evaluateFirst(parent);

    if(tag != null)
    {
        return Option.fromString(tag.getText());
    }

    return Option.none();

Any thoughts would be appreciated! Thanks.

Prichmp
  • 2,112
  • 4
  • 16
  • 17
  • is there a question in here somewhere? i don't understand what youa re asking? – jtahlborn Dec 14 '15 at 01:30
  • I do I extract the contents of `dc:title` out of `record`? – Prichmp Dec 14 '15 at 01:53
  • I don't know about jdom tho, but assuming that you have mapped `dc` to the correct namespace uri `http://purl.org/dc/elements/1.1/`, I think the XPath should work – har07 Dec 14 '15 at 02:23
  • @har07 You were right. What had happened is that I had mapped the dc namespace to http://dublincore.org/documents/2012/06/14/dcmi-terms/?v=elements. (Which is where http://purl.org/dc/elements/1.1/ redirects to. I feel a little silly.) Once I changed it, it started working. This raises more questions than it answers. How did the XML parser know which namespace was right? I don't think i made a HTTP request, and purl.org never appears in the original XML. Anyway, if you add your comment as an answer I'll mark it as right. – Prichmp Dec 14 '15 at 03:44
  • @Gamebear Done posting an answer. It also briefly answer the question in your last comment above – har07 Dec 14 '15 at 04:07
  • Missed these last comments before posting, but note that the dc namespace *is* in the XML, it's in the copy I got off the net, anyway. – rolfl Dec 14 '15 at 04:27
  • I see the dc namespace, but I could not find the http://purl.org/dc/elements/1.1/ uri in the XML off the web anywhere. What am I missing? – Prichmp Dec 14 '15 at 04:37
  • @Gamebear What I did was open [the link](http://memory.loc.gov/cgi-bin/oai2_0?verb=ListRecords&set=cwp&metadataPrefix=oai_dc) in browser (I'm using mozilla firefox). `->` right-click `->` 'view source' – har07 Dec 14 '15 at 06:02

1 Answers1

2

In your XML, dc prefix mapped to the namespace uri http://purl.org/dc/elements/1.1/, so make sure you declared the namespace prefix mapping to be used in the XPath accordingly. This is part where the namespace prefix declare in your XML :

<oai_dc:dc
    xmlns:oai_dc="http://www.openarchives.org/OAI/2.0/oai_dc/"
    xmlns:dc="http://purl.org/dc/elements/1.1/"
    xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
    xsi:schemaLocation="http://www.openarchives.org/OAI/2.0/oai_dc/
                         http://www.openarchives.org/OAI/2.0/oai_dc.xsd">

XML parser only see the namespace explicitly declared in the XML, it won't try to open the namespace URL since namespace is not necessarily a URL. For example, the following URI which I found in this recent SO question is also acceptable for namespace : uuid:ebfd9-45-48-a9eb-42d

Community
  • 1
  • 1
har07
  • 88,338
  • 12
  • 84
  • 137