1

I have a RDF/XML data which I'd like to parse and access the node. It looks like this:

<!-- http://purl.obolibrary.org/obo/VO_0000185 -->

    <owl:Class rdf:about="&obo;VO_0000185">
        <rdfs:label>Influenza virus gene</rdfs:label>
        <rdfs:subClassOf rdf:resource="&obo;VO_0000156"/>
        <obo:IAO_0000117>YH</obo:IAO_0000117>
    </owl:Class>



    <!-- http://purl.obolibrary.org/obo/VO_0000186 -->

    <owl:Class rdf:about="&obo;VO_0000186">
        <rdfs:label>RNA vaccine</rdfs:label>
        <owl:equivalentClass>
            <owl:Class>
                <owl:intersectionOf rdf:parseType="Collection">
                    <rdf:Description rdf:about="&obo;VO_0000001"/>
                    <owl:Restriction>
                        <owl:onProperty rdf:resource="&obo;BFO_0000161"/>
                        <owl:someValuesFrom rdf:resource="&obo;VO_0000728"/>
                    </owl:Restriction>
                </owl:intersectionOf>
            </owl:Class>
        </owl:equivalentClass>
        <rdfs:subClassOf rdf:resource="&obo;VO_0000001"/>
        <obo:IAO_0000116>Using RNA may eliminate the problem of having to tailor a vaccine for each individual patient with their specific immunity. The advantage of RNA is that it can be used for all immunity types and can be taken from a single cell. DNA vaccines need to produce RNA which then prompts the manufacture of proteins. However, RNA vaccine eliminates the step from DNA to RNA.</obo:IAO_0000116>
        <obo:IAO_0000115>A vaccine that uses RNA(s) derived from a pathogen organism.</obo:IAO_0000115>
        <obo:IAO_0000117>YH</obo:IAO_0000117>
    </owl:Class>

The complete RDF/XML file can be found here.

What I want to do is to do the following:

  1. Find chunk where it contains the entry <rdfs:subClassOf rdf:resource="&obo;VO_0000001"/>
  2. Access the literal term as defined by <rdfs:label>...</rdfs:label>

So in the above example the code would go through second chunk and output: "RNA vaccine".

I'm currently stuck with the following code. Where I couldn't access the node. What's the right way to do it? Solutions other than using XML::LibXML are welcomed.

#!/usr/bin/perl -w
use strict;
use Data::Dumper;
use Carp;
use File::Basename;
use XML::LibXML 1.70;

my $filename = "VO.owl";
# Obtained from http://svn.code.sf.net/p/vaccineontology/code/trunk/src/ontology/VO.owl

my $parser = XML::LibXML->new();
my $doc = $parser->parse_file( $filename );

foreach my $chunk ($doc->findnodes('/owl:Class')) {
        my ($label) = $chunk->findnodes('./rdfs:label');
        my ($subclass) = $chunk->findnodes('./rdfs:subClassOf');
        print $label->to_literal;
        print $subclass->to_literal;

}
neversaint
  • 60,904
  • 137
  • 310
  • 477
  • 1
    I'd mention that not only should solutions not using XML libraries be welcomed, but _preferred_; [don't try to parse RDF as XML](http://stackoverflow.com/a/17052385/1281433). It's true that RDF can be serialized in XML, but the same RDF graph can be serialized in XML in _many different ways_, and an XML solution that works on one is rather unlikely to work on another. RDF is _graph-based_ representation and should treated as such. – Joshua Taylor Jul 18 '13 at 12:16

2 Answers2

4

Parsing RDF as if it were XML is a folly. The exact same data can appear in many different ways. For example, all of the following RDF files carry the same data. Any conforming RDF implementation MUST handle them identically...

<!-- example 1 -->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:about="#me">
    <rdf:type rdf:resource="http://xmlns.com/foaf/0.1/Person" />
    <foaf:name>Toby Inkster</foaf:name>
  </rdf:Description>
</rdf:RDF>

<!-- example 2 -->
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/">
  <foaf:Person rdf:about="#me">
    <foaf:name>Toby Inkster</foaf:name>
  </foaf:Person>
</rdf:RDF>

<!-- example 3 -->
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/">
  <foaf:Person rdf:about="#me" foaf:name="Toby Inkster" />
</rdf:RDF>

<!-- example 4 -->
<rdf:RDF
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="">
  <rdf:Description rdf:about="#me"
    rdf:type="http://xmlns.com/foaf/0.1/Person"
    foaf:name="Toby Inkster" />
</rdf:RDF>

<!-- example 5 -->
<rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
  <rdf:Description rdf:ID="me">
    <rdf:type>
      <rdf:Description rdf:about="http://xmlns.com/foaf/0.1/Person" />
    </rdf:type>
    <foaf:name>Toby Inkster</foaf:name>
  </rdf:Description>
</rdf:RDF>

<!-- example 6 -->
<foaf:Person
    xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"
    xmlns:foaf="http://xmlns.com/foaf/0.1/"
    rdf:about="#me"
    foaf:name="Toby Inkster" />

I could easily list half a dozen other variations too, but I'll stop there. And this RDF file contains just two statements - I'm a Person; my name is "Toby Inkster" - the OP's data contains over 50,000 statements.

And this is just the XML serialization of RDF; there are other serializations too.

If you try handling all that with XPath, you're likely to end up becoming a lunatic locked away in a tower somewhere, muttering in his sleep about the triples; the triples...

Luckily, Greg Williams has taken that mental health bullet for you. RDF::Trine and RDF::Query are not only the best RDF frameworks for Perl; they're amongst the best in any programming language.

Here is how the OP's task could be achieved using RDF::Trine and RDF::Query:

#!/usr/bin/env perl

use v5.12;
use RDF::Trine;
use RDF::Query;

my $model = 'RDF::Trine::Model'->new(
    'RDF::Trine::Store::DBI'->new(
        'vo',
        'dbi:SQLite:dbname=/tmp/vo.sqlite',
        '',  # no username
        '',  # no password
    ),
);

'RDF::Trine::Parser::RDFXML'->new->parse_url_into_model(
    'http://svn.code.sf.net/p/vaccineontology/code/trunk/src/ontology/VO.owl',
    $model,
) unless $model->size > 0;

my $query = RDF::Query->new(<<'SPARQL');
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
SELECT ?super_label ?sub_label
WHERE {
    ?sub rdfs:subClassOf ?super .
    ?sub rdfs:label ?sub_label .
    ?super rdfs:label ?super_label .
}
LIMIT 5
SPARQL

print $query->execute($model)->as_string;

Sample output:

+----------------------------+----------------------------------+
| super_label                | sub_label                        |
+----------------------------+----------------------------------+
| "Aves vaccine"             | "Ducks vaccine"                  |
| "route of administration"  | "intravaginal route"             |
| "Shigella gene"            | "aroA from Shigella"             |
| "Papillomavirus vaccine"   | "Bovine papillomavirus vaccine"  |
| "virus protein"            | "Feline leukemia virus protein"  |
+----------------------------+----------------------------------+

UPDATE: Here's a SPARQL query that can be plugged into the script above to retrieve the data you wanted:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#>
PREFIX obo:  <http://purl.obolibrary.org/obo/>
SELECT ?subclass ?label
WHERE {
    ?subclass
        rdfs:subClassOf obo:VO_0000001 ;
        rdfs:label ?label .
}
tobyink
  • 13,478
  • 1
  • 23
  • 35
  • Thanks for the explanation. I usually try to stay away from XML, and there are some XML technologies which I especially try to avoid (e.g. XSD, SOAP) --- RDF will be added to this list :-) – Slaven Rezic Jul 19 '13 at 14:42
  • 1
    You should certainly not add RDF to the list of XML technologies you wish to avoid. Avoid it if you like; fine. But (despite having an XML serialization) it's not an XML technology, so you would have put it on the wrong list. – tobyink Jul 19 '13 at 15:12
  • @tobyink: Thanks. But how can I ensure the label output is the subclass of VO_0000001. – neversaint Jul 20 '13 at 22:48
2

/owl:Class is not the root element in your XML document. You have to include the root element into your XPath: /rdf:RDF/owl:Class. Or if you want to get all occurrences, no matter of the depth in the XML tree, you may use the double-slash notation: //owl:Class.

Slaven Rezic
  • 4,571
  • 14
  • 12
  • Thanks Slaven. But I can't access the 'subClassOf' content. What's the right command for that? – neversaint Jul 18 '13 at 06:05
  • 1
    You can access subClassOf. But it has no literal value (that's the text content between tags, so it appears as empty string. Instead of `to_literal()` try `serialize` to see that it matches. – Slaven Rezic Jul 18 '13 at 06:19
  • @neversaint What subclass content do you mean? In the question you said you were trying to access the value of the `rdfs:label` property of the classes, and then to also identify the values of `owl:subClassOf` property of the classes. What content are you trying to get from the subclasses? – Joshua Taylor Jul 18 '13 at 12:31
  • Really, really, forget XPath for parsing RDF. Use [RDF::Trine](https://metacpan.org/release/RDF-Trine) and possibly [RDF::Query](https://metacpan.org/release/RDF-Query). – tobyink Jul 19 '13 at 00:27
  • @tobyink: why? Learning another two modules for simple things seems overkill to me, especially if one knows how to deal with XPaths. I think you should provide some code for the above example to prove that RDF::Trine/Query are doing things here easier... – Slaven Rezic Jul 19 '13 at 06:06
  • 1
    The explanation is longer than will fit in a comment, but I am happy to provide a full answer. It's coming... – tobyink Jul 19 '13 at 13:19