2

I have a RDF/XML document with this format:

<?xml version="1.0" encoding="UTF-8"?>
<rdf:RDF xmlns:ags="http://purl.org/agmes/1.1/" xmlns:foaf="http://xmlns.com/foaf/0.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:rdfs="http://www.w3.org/2000/01/rdf-schema#" xmlns:bibo="http://purl.org/ontology/bibo/" xmlns:dct="http://purl.org/dc/terms/">
  <bibo:Article rdf:about="http://xxxxx/NO8500391">
    <dct:identifier>NO8500391</dct:identifier>
    ...
  </bibo:Article>
  <bibo:Article rdf:about="http://xxxxx/NO8500523">
    ...
  </bibo:Article>
  <bibo:Article rdf:about="http://xxxxx/NO8500496">
  ...
  </bibo:Article>
</rdf:RDF>

As you can see, in a single RDF/XML file, there are many bibo:Articles, could be thousands. What I want is to extract each article and convert it to RDF/JSON (I know how to write a model) using Apache Jena, so I can have a separate document for each article, and later import them all to a index like CouchDB or Elasticsearch to perform searches.

How can I extract each article in the model (Jena)? The dirty way that I was thinking is to process the file as XML and extract each bibo:Article element.

Joshua Taylor
  • 84,998
  • 9
  • 154
  • 353
po5i
  • 548
  • 5
  • 19
  • As I've said in another answer, [don't process RDF/XML as XML](http://stackoverflow.com/a/17052385/1281433)! The same RDF graph has lots of distinct RDF/XML serializations, and some of them won't have _any_ `bibo:Article` XML elements. – Joshua Taylor Jun 20 '13 at 18:54

1 Answers1

1

Firstly can I ask for some clarfication on your question? I think what you are asking is to split each bibo:Article entry into its own document right?

As an aside note that this is not the same as splitting each first level node because RDF/XML is not a canonical serialization i.e. the same RDF may be serialized by multiple different RDF/XML documents and there is no guarantee that they will always be the first level nodes.

Now to try and answer your question, there are two main ways to achieve your aim.

Using SPARQL Queries

Firstly issue a SELECT query to retrieve all articles instances, then for each result issue a DESCRIBE query on the article URI which will give you a new Jena Model containing only information about that URI.

Note that you can change exactly how DESCRIBE queries by creating a custom DescribeHandler if you wish but that may be overkill.

You can then serialize the results of each DESCRIBE query to a new document.

Using the Model API

Use the listStatements() method (the overload that takes search criteria) to first find the articles, then similar to the first solution issue further calls for each discovered article URI to find statements about it. These can be added to a new model and serialized out as desired.

RobV
  • 28,022
  • 11
  • 77
  • 119
  • I understand that RDF may be serialized differently, I was wrong writting about first level nodes – po5i Jun 20 '13 at 20:00
  • How do I discover all the statements under Article if I have more than 10 statements (with different namespaces)? Do I have to query the moel for each different subject of statement that I want? In my case I have under the bibo:Article : dct:identifier, dct:date, dct:source, bibo:language, dct:creator (and under it I have foaf:Person and foaf:name), and more.. – po5i Jun 20 '13 at 20:16
  • `listStatements(subj, null, null)` - the use of `null` acts as a wildcard (the Javadoc tells you this) - http://jena.apache.org/documentation/javadoc/jena/com/hp/hpl/jena/rdf/model/Model.html#listStatements(com.hp.hpl.jena.rdf.model.Resource,%20com.hp.hpl.jena.rdf.model.Property,%20com.hp.hpl.jena.rdf.model.RDFNode) – RobV Jun 20 '13 at 23:18
  • I have problems using subf,null,null: The method listStatements(Resource, Property, RDFNode) is ambiguous for the type Model. I think because there are two definitions of listStatements with 3 exact arguments each, one in Model and another in ModelCon. Even if I use Selector, the ambiguous problems arises again. – po5i Jun 24 '13 at 22:35
  • 1
    @po5i This is no different to any other Java API, just cast one/more nulls to relevant types to resolve the ambiguity e.g. `listStatements(subj, (Property)null, (RDFNode)null)` – RobV Jun 25 '13 at 05:26