4

I use python with lxml to process the xml. After I query/filter to get to a nodes I want but I have some problem. How to get its attribute's value by xpath ? Here is my input example.

>print(etree.tostring(node, pretty_print=True ))
<rdf:li xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"  rdf:resource="urn:miriam:obo.chebi:CHEBI%3A37671"/>

The value I want is in resource=... . Currently I just use the lxml to get the value. I wonder if it is possible to do in pure xpath ? thanks

EDIT: Forgot to said, this is not a root nodes so I can't use // here. I have like 2000-3000 others in xml file. My first attempt was playing around with ".@attrib" and "self::*@" but those does not seems to work.

EDIT2: I will try my best to explain (well, this is my first time to deal with xml problem using xpath. and english is not one of my favorite field....). Here is my input snippet http://pastebin.com/kZmVdbQQ (full one from here http://www.comp-sys-bio.org/yeastnet/ using version 4).

In my code, I try to get speciesTypes node with resource link chebi (<rdf:li rdf:resource="urn:miriam:obo.chebi:...."/>). and then I tried to get value from rdf:resource attribute in rdf:li. The thing is, I am pretty sure it would be easy to get attribute in child node if I start from parent node like speciesTypes, but I wonder how to do if I start from rdf:li. From my understanding, the "//" in xpath will looking for node from everywhere not just only in the current node.

below is my code

import lxml.etree as etree

tree = etree.parse("yeast_4.02.xml")
root = tree.getroot()
ns = {"sbml": "http://www.sbml.org/sbml/level2/version4", 
      "rdf":"http://www.w3.org/1999/02/22-rdf-syntax-ns#",
      "body":"http://www.w3.org/1999/xhtml",
      "re": "http://exslt.org/regular-expressions"
      }
#good enough for now
maybemeta = root.xpath("//sbml:speciesType[descendant::rdf:li[starts-with(@rdf:resource, 'urn:miriam:obo.chebi') and not(starts-with(@rdf:resource, 'urn:miriam:uniprot'))]]", namespaces = ns)

def extract_name_and_chebi(node):
    name = node.attrib['name']
    chebies = node.xpath("./sbml:annotation//rdf:li[starts-with(@rdf:resource, 'urn:miriam:obo.chebi') and not(starts-with(@rdf:resource, 'urn:miriam:uniprot'))]", namespaces=ns) #get all rdf:li node with chebi resource
    assert len(chebies) == 1
    #my current solution to get rdf:resource value from rdf:li node
    rdfNS = "{" + ns.get('rdf') + "}"
    chebi = chebies[0].attrib[rdfNS + 'resource'] 
    #do protein later
    return (name, chebi)

    metaWithChebi = map(extract_name_and_chebi, maybemeta)
fo = open("metabolites.txt", "w")

for name, chebi in metaWithChebi:
    fo.write("{0}\t{1}\n".format(name, chebi))
mhucka
  • 2,143
  • 26
  • 41
Tg.
  • 5,608
  • 7
  • 39
  • 52
  • Parsing rdf xml with xpath is really not a great idea. XML is a tree, but RDF is a graph, and you can represent the same rdf graph with different rdfxml. You should view the xml as just an interchange format and use an RDF library to create the graph from the XML, then work with the graph directly. – Francis Avila Dec 07 '11 at 02:27
  • Thanks for your suggestion. But in this work, I just want to extract nodes with some information then do some format on it to use within spreadsheet. – Tg. Dec 07 '11 at 04:18

3 Answers3

3

Prefix the attribute name with @ in the XPath query:

>>> from lxml import etree
>>> xml = """\
... <?xml version="1.0" encoding="utf8"?>
... <rdf:RDF xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
...     <rdf:li rdf:resource="urn:miriam:obo.chebi:CHEBI%3A37671"/>
... </rdf:RDF>
... """
>>> tree = etree.fromstring(xml)
>>> ns = {'rdf': 'http://www.w3.org/1999/02/22-rdf-syntax-ns#'}
>>> tree.xpath('//rdf:li/@rdf:resource', namespaces=ns)
['urn:miriam:obo.chebi:CHEBI%3A37671']

EDIT

Here's a revised version of the script in the question:

import lxml.etree as etree

ns = {
    'sbml': 'http://www.sbml.org/sbml/level2/version4',
    'rdf':'http://www.w3.org/1999/02/22-rdf-syntax-ns#',
    'body':'http://www.w3.org/1999/xhtml',
    're': 'http://exslt.org/regular-expressions',
    }

def extract_name_and_chebi(node):
    chebies = node.xpath("""
        .//rdf:li[
        starts-with(@rdf:resource, 'urn:miriam:obo.chebi')
        ]/@rdf:resource
        """, namespaces=ns)
    return node.attrib['name'], chebies[0]

with open('yeast_4.02.xml') as xml:
    tree = etree.parse(xml)

    maybemeta = tree.xpath("""
        //sbml:speciesType[descendant::rdf:li[
        starts-with(@rdf:resource, 'urn:miriam:obo.chebi')]]
        """, namespaces = ns)

    with open('metabolites.txt', 'w') as output:
        for node in maybemeta:
            output.write('%s\t%s\n' % extract_name_and_chebi(node))
ekhumoro
  • 115,249
  • 20
  • 229
  • 336
  • Forgot to said, this is not a root node so I don't think "//" is work here – Tg. Dec 07 '11 at 04:19
  • @Tg. I don't understand your comment or the edit you've added to your question. What is the structure of your xml file? And what code are you currently using to parse and query it? If you posted a small, working example script like the one in my answer, it would be a lot easier for people to provide more useful answers. – ekhumoro Dec 07 '11 at 17:08
  • @Tg. I've updated my answer with a revised version of the script in your question. It produces exactly the same output. – ekhumoro Dec 09 '11 at 21:14
1

To select off the current node its attribute named rdf:resource, use this XPath expression:

@rdf:resource

In order for this to "work correctly" you must register the association of the prefix "rdf:" to the corresponding namespace.

If you don't know how to register the rdf namespace, it is still possible to select the attribute -- with this XPath expression:

@*[name()='rdf:resource']
Dimitre Novatchev
  • 240,661
  • 26
  • 293
  • 431
0

Well, I got it. The xpath expression I need here is "./@rdf:resource" not ".@rdf:resource". But why ? I thought "./" indicate the child of current node.

Tg.
  • 5,608
  • 7
  • 39
  • 52